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Preface 


These are the proceedings of the 23" annual event of IDEAS. We find the challenge of holding a 
quality conference increasing with the larger number of conferences either underwritten by “not-for- 
profit “organizations with all invited papers or with papers guaranteed to be accepted. It is heartening 
to learn that in spite of these challenges, we received 81 papers this year with new approaches and 
ideas. This allows us to continue to be selective. This meeting highlights the current pre-occupation 
with big data, block chain, data analytics and the issue of personal data on the web platform; this is 
reflected in the accepted papers in these proceedings. 


We would like to take this opportunity to thank the local and publicity chairs and the program 
committee for their help in the review process. All the submitted papers were assigned to four 
reviewers and we got back over 2.6 reviews on the average due to the shorter review periods. The 
proceedings consist of 31 full papers(38%), 8 short papers (16%) and 5 poster papers(11 96) . 


We are honoured to have two excellent keynote speakers: Schahram Dustdar(Technical University of 
Vienna) and Foto Afrati,(National Technical University of Athens). The abstract and the invited paper, 
respectively, of these talks are included in the proceedings.. 


Acknowledgment: This conference would not have been possible without the help and effort of many 
people and organizations. Thanks are owed to: 


- ACM (Anna Lacson, Craig Rodkin, and Barbara Ryan), 

- BytePress, ConfSys.org, Concordia University (Nathalie Blair, Kunsheng Zhao, Ming Lu, Gerry 
Laval, and Will Knight), 

- George Dimitrakopoulos, Dimitris Michail along with the support staff at Harokopio University of Athens 
who contributed selflessly and were involved in organizing and supporting the local events. 


We appreciate their efforts and dedication. 


Bipin C. Desai Dimosthenis Anagnostopoulos Mara Nikolaidou Yannis Manolopoulos 
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Novel Paradigms for engineering large-scale resilient IoT Systems 


Schahram Dustdar 


Distributed Systems Group, TU Wien, Austria 
dustdar@dsg.tuwien.ac.at 


Abstract— This invited talk explores the research 
challenges in the domain of IoT from multiple 
angles and reflects on the urgently needed collective 
efforts from various research communities to 
collaborate on those. Our approach fundamentally 
challenges the current understanding of scientific, 
technological, and political paradigms in tackling 
the engineering of large-scale IoT systems. We 
discuss technical paradigms and research challenges 
in the domains of Cloud and Edge Computing as 
well as the requirements of people in such systems 
embedded in Smart Cities. 


"Permission to make digital or hard copies of part or all of this work for 
personal or classroom use is granted without fee provided that copies are not 
made or distributed for profit or commercial advantage and that copies bear 
this notice and the full citation on the first page. Copyrights for components 
of this work owned by others than ACM must be honoured. Abstracting with 
credit is permitted. To copy otherwise, to republish, to post on servers or to 
redistribute to lists, requires prior specific permission and/or a fee. Request 
permissions from Permissions@acm.org. 

IDEAS 2019, June 10-12, 2019, Athens, Greece 


CCS Concepts— ° Security and 
privacy~Information flow control * Security 
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The Homomorphism Property in Query Containment and Data 
Integration 


Foto N. Afrati 
National Technical University of Athens 
afrati@gmail.com 


ABSTRACT 


We often add arithmetic to extend the expressiveness of 
query languages, tuple generating dependencies and data 
exchange mappings, and study the complexity of problems 
such as testing query containment and finding certain answers. 
When adding arithmetic comparisons, the complexity of such 
problems is higher than the complexity of their counterparts 
without them. It has been observed that we can achieve 
lower complexity if we restrict some of the comparisons to 
be closed or open semi-interval comparisons. Here, focusing 
on the problem of containment for conjunctive queries with 
arithmetic comparisons (CQAC queries, for short), we prove 
upper bounds on the computational complexity. 

Our main methodology uses a general property of CQACs 
and tuple generating dependencies with arithmetic compar- 
isons which is called the homomorphism property. When the 
homomorphism property holds, then the complexity of the 
above problems can be improved. We syntactically character- 
ize subclasses of CQACs queries that have the homomorphism 
property, and we give a detailed proof that contains compo- 
nents that can be used to prove more results of the same 
kind. Similar methodology can be applied to achieve better 
upper bounds on the complexity of testing query containment 
under dependencies, finding query rewritings, and finding 
certain answers in data exchange. This is done by improving 
the complexity of the chase algorithm for tuple generating 
dependencies with arithmetic comparisons. 


CCS CONCEPTS 


e Information systems — Relational database model. 


KEYWORDS 


query containment, query rewriting, homomorphism 
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1 INTRODUCTION 


Homomorphisms are central in many database problems, such 
as query containment, finding rewritings for answering queries 
using views and the chase algorithm. For an example, suppose 
we want to check whether a conjunctive query is contained in 
another conjunctive query. Then it is necessary and sufficient 
to check whether there is a homomorphism from one query to 
the other [6]. For conjunctive queries, the query containment 
problem is NP-complete [6]. When we have constants that 
are numbers (e.g., they may represent prices, dates, weights, 
lengths, heights) then, often, we want to compare them by 
checking, e.g., whether two numbers are equal or whether 
one is greater than the other, etc. 'To reason about numbers 
we want to have a more expressive language than conjunctive 
queries and, thus, we add arithmetic comparisons to the 
definition of the query. We know that the query containment 
problem for conjunctive queries with arithmetic comparisons 
is II-complete [10, 11, 19]. 

In previous literature [1—3, 20], it has been noticed that 
there are wide classes of CQACs for which the query con- 
tainment problem remains in NP and these classes can be 
syntactically characterized. In this paper, we provide a de- 
tailed proof of results that have been presented earlier in 
extended abstract form in [3]. We present new results too. 

For the proof, we need to go through a very careful techni- 
cal analysis. Central to this analysis is the homomorphism 
property. Roughly (we will explain in technical terms shortly), 
for checking containment for CQACs we need to check for 
many homomorphisms and not just for one homomorphism 
and this is what usually affects the complexity. We charac- 
terize syntactically many cases where one homomorphism 
suffices to decide containment, which is what defines the 
homomorphism property. We will give an example in the next 
section to illustrate how we need multiple mappings. 

We also consider domain information and extend the re- 
sults about the homomorphism property. Domain information 
exploits the fact that, e.g., we do not compare a variable that 
represents time in minutes with a variable that represents 
weight in pounds. We can find such independent variables 
technically using a graph constructed from the arithmetic 
comparisons in the queries. Finally, we present a new result 
beyond the homomorphism property in Theorem 5.1. 

Related work The homomorphism property for query 
containment was studied in [2, 10, 21]. Recent work can be 
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found in [8], where the authors propose to extend graph func- 
tional dependencies with linear arithmetic expressions and 
arithmetic comparisons. They study the problems of testing 
satisfiability and related problems over integers (i.e., for non- 
dense orders). A thorough study of the complexity of the 
problem of evaluating conjunctive queries with inequalities 
(Æ) is done in [12]. In [16] the complexity of evaluating con- 
junctive queries with arithmetic comparisons is investigated 
for acyclic queries, while query containment for acyclic con- 
junctive queries was investigated in [7]. Recent works [5, 18] 
have added arithmetic to extend the expressiveness of tuple 
generating dependencies and data exchange mappings, and 
studied the complexity of related problems. 


2 PRELIMINARIES 


We will state our results in detail first by referring to conjunc- 
tive queries with arithmetic comparisons and query contain- 
ment. Thus we will define here these queries and later in the 
paper we will define what is a query rewriting using views. 
Then, we will extend the results about query containment 
to query rewriting and computing certain answers. We will 
discuss briefly the chase algorithm based on dependencies 
with arithmetic comparisons. 

A conjunctive query (CQ in short) is a query of the form: 
h(X) + e(X3),...,ex(Xx), where h(X) and e;(X;) are 
atoms, i.e., they contain a relational symbol (h and e; here) 
and a vector of variables and constants. The head h(X) 
represents the results of the query, and e1...e; represent 
database relations (also called base relations). The part of 
the conjunctive query on the right of symbol : — is called the 
body of the query. Each atom in the body of a conjunctive 
query is said to be a subgoal. Every argument in the subgoal 
is either a variable or a constant. The variables in X are 
called head or distinguished variables, while the variables 
in X; are called body variables of the query. A conjunctive 
query is said to be safe if all its distinguished variables also 
occur in its body. We only consider safe queries here. The 
result of a CQ when applied on the base relations (i.e., when 
applied on a database instance) is the set of atoms h such 
that there is an assignment of variables on both sides of the 
query that makes all atoms in the body of the query true. 

A query Qi is contained in a query Q2, denoted Qi E 
Qe, if for any database D of the base relations, the answer 
computed by Qı is a subset of the answer computed by Q2, 
i.e., Q1(D) C Q2(D). The two queries are equivalent, denoted 
Qi = Qa, if Qi C Q2 and Q2 E Q1. 

A homomorphism from a set of relational atoms to another 
set of relational atoms is a mapping of variables from one 
set to variables or constants of the other set that maps each 
variable to a single variable or constant and each constant 
to the same constant. Each atom of the former set should 
map to to an atom of the latter set with the same relational 
symbol. 

A containment mapping from a conjunctive query Qı to a 
conjunctive query Q» is a homomorphism from the atoms in 
the body of Qı to the atoms in the body of Q2 that maps 
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the head of Qı to the head of Q2. In this paper, when we 
consider homomorphisms between queries, they are always 
containment mappings, so we use the two terms interchange- 
ably. 

Chandra and Merlin [6] show that a conjunctive query Q1 
is contained in another conjunctive query Q» if and only if 
there is a containment mapping from Q» to Qı. 

Conjunctive queries with arithmetic comparisons (CQAC 
for short) are conjunctive queries that, besides the ordinary 
relational subgoals use also builtin subgoals that are arith- 
metic comparisons (AC for short), i.e., of the form X0Y 
where Ó is one of the following: «,»,x,2,-—,z. Also, X isa 
variable and Y is either a variable or constant. If 0 is either 
« or » we say that it is an open arithmetic comparison and 
if 0 is either < or > we say that it is a closed AC. Moreover, 
the following assumptions must hold: 

1) Values for the arguments in the arithmetic comparisons 
are chosen from an infinite, totally densely ordered set, such 
as the rationals or reals. 

2) The arithmetic comparisons are not contradictory; that 
is, there exists an instantiation of the variables such that all 
the arithmetic comparisons are true. 

3) All the comparisons are safe, i.e., each variable in the 
comparisons also appears in some ordinary subgoal. 

We use “CQ” to represent “conjunctive query", “AC” for 
“arithmetic comparison", and *CQAC" for “conjunctive query 
with arithmetic comparisons.” 

e The notation we use for a CQAC query Q is Q = Qo+ 8, 
where Qo are the relational subgoals of Q and ( are 
the arithmetic comparison subgoals of Q. 


Definition 2.1. Let Q1 and Q» be two conjunctive queries 
with arithmetic comparisons (CQACs). We want to test 
whether Q2 E Qi. To do the testing, we first normalize each 
of Qi and Q2 to Qi and Q5, respectively. We normalize a 
CQAC query as follows: 

e For each occurrence of a shared variable X in a normal 
(i.e., relational) subgoal, except for the first occurrence, 
replace the occurrence of X by a fresh variable X;, and 
add X = X; to the comparisons of the query; and 

e For each constant c in a normal subgoal, replace the 
constant by a fresh variable Z, and add Z — c to the 
comparisons of the query. 


Theorem 2.2[9, 21] below says that Q2 E Qı iff the com- 
parisons in the normalized version Q5 of Q» logically imply 
(denoted by *—") the disjunction of the images of the com- 
parisons of the normalized version Q1 of Qı under each 
containment mapping from the ordinary subgoals of Q^ to 
the ordinary subgoals of Q5. 


THEOREM 2.2. Let Q1, Q» be CQACs, and Qi = Qio + 
BY, QS = Q5o + B5 be the respective queries after normaliza- 
tion. Suppose there is at least one containment mapping from 
Qio to Q5o. Let ui, ..., ux be all the containment mappings 
from Qio to Qo. Then Q2 E Q1 if and only if the following 
logical implication ¢ is true: 


b : By => pi (B1) V ... V ua (B1). 
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(We refer to $ as the containment entailment in the rest of 
this paper .) 


PROOF. One of the directions is straightforward: If the 
containment entailment is true, then in any database that 
satisfies £5, one of the 1 (81) will be satisfied (because we 
deal with constants), and hence containment is proven. 

For the “only-if” direction, suppose Q2 is contained in Qi, 
but the containment entailment is false. We assign constants 
to the variables that make this implication false. Then for 
all the containment mappings ui (for each of which u:(81) 
does not hold), the query containment is false, because we 
have found a counterexample database D. Database D is 
constructed by assigning the corresponding constants to the 
ordinary subgoals of Q2. On this counterexample database 
D, Q2 produces a tuple, but there is no u; that will make 
Qı produce the same tuple (because all j4 (081) fail). We 
need to remember that, using the u;’s, we can produce all 
homomorphisms from Q; to any database where the relational 
atoms of Q2 map via a homomorphism. This is because 
the p;'s were produced using the normalized version of the 
queries — and, hence, u;’s were not constrained by duplication 
of variables or by constants (recall that, in a homomorphism, 
a variable is allowed to map to a single target and a constant 
is allowed to map on the same constant). 


Example 2.3. We will apply Theorem 2.2 to prove that Qı 
contains Qe. 


Qi -a(X1, Ya, Z1), Xi = Yı, Zi < 5 
Q2 :-a(X,Y,Z'),a(X',Y’,Z),X € 5,Y € X,Z € Y, 
X'-Y'.Z' <5 

We have two containment mappings: 

pa : Xı > X, Yı > Y, Zı »Z 

W:X > X, Y1—5 Y,Z—Z 

Hence, ui(X1) = X, py (Yı) = Y, pı (Z1) = Z’ and 
ua(X1) = X", ua(Y1) = Y',ua(21)) =Z 

The continament entailment is: 


XX5AYXXAZEXYAX -Y'AZ'«52 
(a (X1) = wi (¥1) ^ ji (Z1) € 5) V 
(uz (X1) = ua(Y1) ^ p2(Z1) < 5) 

which is written equivalently: 

X«X5AYXXAZEXYAX -Y'AZ «5 
(X2YAZ' <5)V(X'=Y'AZ <5) 

The above logical implication is true. 


2.1 Homomorphism Property (HP for 
short) 
The homomorphism property defined below will be used to 


prove that for certain classes of CQAC queries, the query 
containment problem is in NP. 


Definition 2.4. (Homomorphism property.) Let Qı and 
Q2 be two classes of CQAC queries. We say that the homo- 
morphism property holds from Qı to Qə if for any pair of 
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normalized queries Qi € Qı and Q2 € Qə the following two 
statements are equivalent: 

1. Q2 is contained in Qi. 

2. There is a homomorphism p from Qio to Q20, such that 
the following is true: 


Bo => a(b). 


A trivial example where the homomorphism property holds 
is when both classes are equal to the class of CQs, i.e., con- 
junctive queries without arithmetic comparisons. Another 
trivial example is when Q is equal to the class of CQs and 
O» is equal to the class of CQACs. 

'The following theorem puts the query containment problem 
in NP when the homomorphism property holds. 


THEOREM 2.5. Let Qı and Q2 be two classes of CQAC 
queries such that the homomorphism property holds from Q1 
to Q2. Then checking containment of a query Q2 € Q2 in a 
query Qi € Qı can be done in nonodeterministic polynomial 
time. 


PRoor. The witness is a mapping u from Q1o to Q20. We 
need to check that u is a homomorphism and that 


B» => u(f1). 


It is easy to see how to check the former in polynomial time. 
For the latter we check whether —(82 ^ ^4u(01)) is true. To 
prove that this can be done in polynomial time we use the 
algorithm in Subsection 2.2. The algorithm finds the strongly 
connected components of a directed graph (this can be done 
in polynomial time for any directed graph) and argues on 
them. 


We will prove in the rest of the paper that the homomor- 
phism property holds from Qı to Qə if we restrict Qı to 
classes of conjunctive queries with comparisons which com- 
pare a variable to a constant. In particular, we define different 
classes of such comparisons in the following. 

We use var and const to denote any variable or any con- 
stant. We define: 


e Semi-interval (SI for short) arithmetic comparisons are 
the comparisons that compare a variable to a constant 
and do not use Z , e.g., X < 6, Y > 8 are all SIs, while 
X #5 is not an SI. 

e Left semi-interval (LSI for short) arithmetic compar- 
isons are SIs of the form var < const or var < const, 
where var is a variable and const is a constant. Sym- 
metrically, we define right semi-interval (RSI for short) 
arithmetic comparisons to be of the form var > const 
or var > const. Thus, e.g., X < 5isan LSI and X > 5 
is an RSI. 

e A point inequality arithmetic comparison use # and 
compares a variable to a constant, e.g., X Z6, Y Z8 
are all PIs, while X Z Y is not an PI because uses two 
variables (both X and Y are variables). 
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2.2 The algorithm to check satisfaction of 
a collection of ACs 


We will present algorithm AC-sat which, on input a collec- 
tion of ACs, checks whether there is a satisfying assignment, 
ie., an assignment of real numbers to the variables that 
makes all ACs in the collection true. If there is not then we 
say that the conjunction of ACs is false or that the collection 
of ACs is contradictory or that the collection of ACs is not 
consistent. 

We define the induced directed graph of a collection of 
ACs. The induced directed graph has nodes that are vari- 
ables or constants. There is an edge labeled < between two 
nodes ni,n2 if there is an AC in the collection which is 
nı < n2. There is an edge labeled < between two nodes 
nı, nə if there is an AC in the collection which is n1 < ne. 
(We only label edges « or < since the other direction, > 
or > is indicated by the direction of the edge.) We treat 
each equation X = Y as two ACs of the form X < Y and 
X > Y and we add edges accordingly. Finally we add “<” 
edges between all pairs of constants depending on their order. 


Algorithm AC-sat We consider the induced directed 
graph of the collection of ACs and we find all strongly con- 
nected components of it. We say that an edge belongs to a 
strongly connected component if it joins two nodes in this 
strongly connected component. 

The collection of ACs is contradictory if either of the 
following cases is true. 

Case 1. There is a strongly connected component with two 
distinct variables belonging to it. 

Case 2. There is a strongly connected component with an 
edge labeled «. 

Case 3. There is a A1 # Az AC such that Ai and A2 
belong to the same strongly connected component and this 
component has only < edges on it. 


LEMMA 2.6. The algorithm AC-sat is a complete and 
sound procedure to check that a conjunction of ACs is con- 
tradictory. 


PROOF. First we prove that this procedure is complete. I.e., 
we prove that if the procedure shows that the conjunction 
is not false then we can assign consistently constants to 
variables to make all ACs true. 

Since either Case 1 nor Case 2 happens, all strongly con- 
nected components have < labels and at most one constant. 
Thus, we assign to each of the elements of a strongly con- 
nected component the constant which is either a new constant 
or the constant of the component as follows: We collapse each 
strongly connected component to one node and the induced 
directed graph is reduced to an acyclic directed graph. We 
consider a topological sorting of this acyclic graph into a 
number of levels. We assign constants (different constants to 
different variables and constants that are different from the 
constants alsready present in the collection of ACs) following 
this topological sorting, so that constants in the next level 
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are greater than the constants in the previous levels. This 
makes all ACs true. 

Now we prove that this prodedure is sound. Whenever the 
procedure stops in Cases 1 and 2 then there is no assignment 
that satisfies all ACs in this strongly connected component 
because there is a cycle with either two distinct variables on 
it or with an edge labeled «. Whenever the procedure stops 
in Case 3, then Ai and A» should be equal according to the 
strongly connected component they belong. Thus we cannot 
find an assignment that satisfies also the AC Ai Z Ao. 


A byproduct of the above proof which will be useful later 
is the following lemma. 


LEMMA 2.7. A conjunction of ACs is false iff there is one 
# AC, ai, such that the conjunction of ACs that is created 
after dropping all 4 ACs and keeping only a; is also false. 


2.3 When normalization is not necessary 


When we use only closed ACs, then normalization is not 
necessary: 


THEOREM 2.8. Consider two CQAC queries, Q1 = Qio + 
£1 and Q2 = Q20 + b2 over densely totally ordered domains. 
Suppose bı contains only < and >, and each of B1 and £5 
does not imply any “=” restrictions. Then Q2 E Q1 if and 
only if 


$ : B2 => m (F1) V ... V m1), 


where 1, ...,^n are all the containment mappings from Q1o 


to Q20. 


PROOF. One of the directions is straightforward: If the 
containment entailment is true, then in any database that 
satisfies G2, one of the 7;((1) will be satisfied (because we 
deal with constants), and hence containment is proven. 

Now we prove the other direction: Suppose Q» is contained 
in Qi, and the implication in the statement of the theorem 
is false. Then there is an assignment o of values that are 
constants from a densely ordered domain to the variables 
that satisfies the left-hand side, o (82), but not the right-hand 
side of the containment entailment. This assignment c can 
create a counterexample database. The critical observation is 
the following: Suppose, in this assignment c, there are either 
two or more variables that are equal to the same constant 
not in the query or there is at least one variable that is equal 
to a constant appearing in the queries. Then there is another 
assignment c' where this does not happen (i.e., all variables 
are assigned in ø‘ to distinct constants) and such that o’ (62) 
is true. We will use o’ to create a counterexample database 
isomorphic to the relational subgoals of Q2. 

We create o’ from c as follows: Suppose N is the number 
of variables in the queries. We choose a small value e so that 
Ne is much smaller than any distance between the constants 
used in o and also between the constants in the queries 
and between constants used in ø and between constants 
appearing in the queries. We create o’ as follows: Since 03 
is consistent and it does not imply “=”, the variables that 
have the same value co in o form an acyclic graph on the 
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induced directed graph of the ACs of 82. We can choose any 
total order that is deduced from this acyclic graph and assign 
values/constants according to this order. In particular we 
assign distinct constants and each new constant is within a 
distance Ne from the original constant, co. E.g., if we have 
only two variables X and Y with value co in c, and we have 
in b2 the AC X € Y, then in o’ we have X = co — € and 
Y —oco- c. 

Claim If a conjunction 7 of closed ACs that use only 
constants appearing in Qi and Q2 become true with the 
assignment o’ then they become true with the assignment c 
too. ILe., if c'(5) is true then a(n) is true too. 

Proof of the Claim. Observe that the relation between c 
and ø’ is the following by construction: The variables in 7 
can be partitioned into subsets, pairwise disjoint and for each 
subset the following holds: a) the values of the variables in c 
are equal to each other b) the values of the variables in g’ are 
all distinct but within a radius of Ne of each other and c) the 
values of the variables in g’ are within distance much greater 
than Ne from constants appearing in 7 except the constant 
in 7 that “belongs” in the particular subset. Hence, in o' (1), 
any AC var € const (where const is a constant appearing 
in 7 and var is a variable) has the same truth value as in 
a(n) for all variables except when they are compared to the 
constant that belongs to their subset. This concludes the 
proof of the Claim. 

Now, we turn the relational subgoals of Q» into a database 
D' by assigning the values in o’ to variables and, for variables 
that do not appear in g’ we assign arbitrary distinct values 
much greater that any consant appearing in the queries. 
Since Q» is contained in Qi, there must be a homomorphism 
hi from the relational subgoals of Qi to D’ such that the 
ACs in Qı are satisfied, i.e., such that o'(h4(01)) is true. By 
construction, D' is isomorphic to the relational subgoals of 
Q». Hence hj has an “isomorphic” homomorphism ym among 
the »y;'s for which c' (^, (£1)) is true. 

Now we apply the Claim for n = (£1) and deduce that 
the assignment c makes the ym(61) true. Hence, we arrive 
at a contradiction. 


3 ANALYSING THE CONTAINMENT 
ENTAILMENT: CONTAINMENT 
IMPLICATIONS 


In this section, we develop tools for the proofs we provide 
later. Consider the containment entailment in Theorem 2.2; 
we have dropped the primed versions of 3; and f» but this 
is what we are referring to. 


Bo => pa (£1) V... V Ia (£4). 


The right hand side of the containment entailment is a dis- 
junction of disjuncts, where each disjunct is a conjunction 
of ACs. We can turn this, equivalently, to a conjunction of 
conjuncts, where each conjunct is a disjunction of ACs. We 
call each of these last conjuncts a rhs-conjunct (from right 
hand side conjunct). Now we can turn the containment en- 
tailment, equivalently, into a number of implications. n each 
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implication, we keep the left hand side of the containment 
entailment the same and have the right hand side be one of 
the rhs-conjuncts. We call each such implication a contain- 
ment implication. Since each logical implication a — b can be 
turned, equivalently into a disjunction 7a V b, we turn each 
containment implication into a disjunction of the form 


^fi V rhsci V rhsea V ... 


where rhsci, rhsc2,... are the arithmetic comparisons from 
the particular rhs-conjunct. Finally, we turn each such dis- 
junction into the negation of an expression E of the form 


E = Bo ^ ^rhsei ^ ^rhsca A... (1) 


It is easy to see that E, = ^rhsci ^ ^rhsca ^... has been cre- 
ated from the particular rhs-conjunct after we have negated 
each of the ACs of this rhs-conjunct. Thus, we call E, a 
reverse rhs-conjunct — we may also refer to the ACs of this 
without using “reverse” and we assume it is understood. We 
illustrate on an example. 


Example 3.1. We continue from Example 2.3. We repeat 
the queries considered: 


Qi -a(X1, Ya, Zi), X1 = Yi, Zi «5 
Q2 -a(X,Y,Z),a(X,Y',Z) X <5,Y € X,Z XY, 
X'-Y' Z' «5 

Now we consider the containment entailment we built in 
Example 2.3. According to what we analyzed in this section, 
we can rewrite this containment entailment equivalently by 
transforming its right hand side into a conjunction, where 
each conjunct is a disjunction of ACs. 

XX5AYXXAZXYAX'SY'AZ! «5 2 (X=Y v X'zY?) 
A(X2Y V Z«5)A(Z'«5vV X'ZY')^(Z'«5VZ«5) 

The above implication has 4 rhs-conjuncts, e.g., the (X — 
Y v X' = Y’) is one rhs-conjunct. 

Now, we have 4 containment implications (one for each rhs- 
conjunct). We only list one of the containment implications: 

XX5A^AYXXAZEXYAX'-Y'AZ'«5— 
(X=YVX=Y’). 

The above containment implication can be written equiva- 
lently as: 

AX <5AY<SXAZ<KYAX=Y'AZ' <5)V 
(X = Y v X' = Y’). which can be written equivalently as: 


(X S5AY SXAZSYAX'=Y'AZ' «5^ 

X SY A(X’ =). 

Now we can replace =(X = Y) with X # Y and 7=(X’ = 
Y^) with X’ z Y' and get, equivalently: 

=E = 
-(Xs5^YXXAZEXYAX' —Y' NZ «5^ 
X £YAX Y. 

E.g., (X Z Y ^ X' Æ Y?) is one of the 4 reverse rhs- 
conjuncts. 


Now, we are ready to prove our first lemma: 
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LEMMA 3.2. Consider the containment entailment 


Bo => pa (B1) V ... V ux (Bi). 


The containment entailment is equivalent to one with only one 
disjunct on the right hand side if and only if each containment 
implication has one disjunct di on the right hand side such 
that 

B» > di 


PnRoor. The one direction is obvious. For the other direc- 
tion, we argue as follows. 'Towards contradiction, suppose 
each disjunct in the containment entailment has at least one 
AC, say a;, such that the implication 


Bo — ai 


is not true. Then take the disjunction of all these a;'s and 
consider the corresponding containment implication that is 
formed by having this disjunction on its right hand side; call 
this containment implication c;. The premises of the lemma 
says that, for each containment implication, we have at least 
one AC, d;, on its rhs such that 


Bo => di 


However, we assumed than in c; all the ACs on the right 
hand side are not implied (each one of them) by £5, hence 
contradiction. 


THEOREM 3.3. If the HP holds for the containment en- 
tailment then, for each containment implicationm there is an 
AC, ai, on the rhs such that B2 => ai. 


The proof of the above theorem is a direct consequence of 
Lemma 3.2. It should be clear now, that we will use the above 
lemma in the following way: When we want to argue on the 
homomorphism property, we will focus on the containment 
implications rather than on the containment entailment. 

Convention: 


e We will refer to the expression E that we created 
above often. Remember E comes from a containment 
implication and this implication is true iff 5E is true. 

e E is a conjunction of ACs. It contains ACs that come 
from the left hand side of the implication (and we will 
refer to them as lhs ACs) and ACs that come from the 
right hand side of the implication (and we will refer to 
them as rhs ACs). 


The lemmas in the Appendix are used to set conditions 
for the containment implications to be true in special cases 
when some arithmetic comparisons are restricted to semi- 
interval comparisons. The intuition in their proof is that 
the rhs ACs are the ones that will introduce inconsistencies 
and it is critical to know how many rhs ACs are needed to 
introduce these inconsistencies. The HP holds only if one of 
them suffices to introduce inconsistencies. 


4 CLASSES OF CQACS WHERE HP 
HOLDS (SI ACS) 


In order to state the results clearly, we use the following 
definition. 
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Definition 4.1. An AC-type is one of the elements of the 
following set Tac: 


Tac = {var < var, var < var, var € const, var < const, const € var, 


const < var, var =var, var = const, var + var, var Æ const} 

An AC X « Y is of type var < var if both X and Y are 
variables. If X is a variable and Y is a constant then it is 
of typle var « const. The rest are defined similarly in the 
obvious way. 


For an example, a closed LSI AC is of type var < const. 

An AC-family is defined by a subset of T'Ac. An AC belongs 
to a specific AC-family if it is of type that belongs in the 
family. 

We present the main results about homomorphism prop- 
erty in three theorems; these results first appeared in [3]. The 
proofs in all three theorems below are direct consequence of 
Lemma 3.2 and Lemmas A.1 (for Theorems 4.2 and 4.3) and 
A.2 (for Theorem 4.5). The theorems below do not use all 
the potential of the lemmas in the appendix; however, it is 
easy to state more similar theorems if we look closer into 
the proofs of these lemmas. In [3], there are more refined 
statements of similar results; in the present paper we focus 
on the proof technique. 


THEOREM 4.2. Suppose Q1 is the class of CQAC queries 
whose normalized version uses ACs from (var < const, var < 
const, var = var, var = cons) and Qə is the class of CQAC 
queries whose normalized version uses ACs from TAc —(var € 
const). Then, the homomorphism property holds from Q1 to 
Q2 


THEOREM 4.3. Suppose Q; is the class of CQAC queries 
whose normalized version uses ACs from {var < const, var < 
const, var = var, var = cons) and Qə is the class of CQAC 
queries whose normalized version uses ACs from Tac. Then, 
the homomorphism property holds from Q1 to Q2 under the 
condition that the considered queries Q1 € Qı and Q2 € Q2 
are such that they have the following property: There is no 
constant shared by a) an open LSI of Q1, and b) a closed LSI 


of Q2 n 

We give an example to show that the conditions in the 
theorem are tight, in that we only use LSIs in both queries. 
Specifically Qi uses two ACs, one var = const and the other 
var « const. 

Example 4.4. 

Qi -a(X,Y), X25, Y <5 
Q2 -a(X, Y), a(Y, Z),X =5,5 > Y,Z <5 


or an arbitrarily long query: 


Q5 -a(X, Yı), a(Yi, Y3), a(Y2, Y3), a(Ys, Ya), poate) a(Yn, Z), 
X=5,5>Y1,5> Y2,5> Y3,...,5 > Y,,Z «65 
Both Q2 and Q% are contained in Q1 and they both need 
more than one mappings to prove containment, hence the 


homomorphism property does not hold. Moreover Q5 demon- 
strates that even if Qı has only one relational subgoal and 
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two AC subgoals, there is a query Q such that, in order 
to test containment of Q5 to Qı we need arbitrarily many 
mappings. Le., for Q5 there are no n — 1 mappings such that 
they suffice to prove that the containment entailment is true. 


'The following theorem considers containing queries with 
closed SI ACs. 


THEOREM 4.5. Suppose Q1 is the class of CQAC queries 
whose normalized version uses ACs from (var € const, var 7 
const, var = var, var = cons, ) and Q» is the class of CQAC 
queries whose normalized version uses ACs from Tac — 
{var > var,var > cons,cons > var}. Then, the homomor- 
phism property holds from Qı to OQ» under the condition that 
the considered query Qı € Qı has the property: There is 
a constant c such that all constants in RSI ACs of Qi are 
greater than c and all constants in LSI ACs of Q:are less 
than c. 


5 BEYOND THE HOMOMORPHISM 
PROPERTY 


There are other subclasses of CQAC for which the homo- 
morphism property may not hold but the query containment 
problem is in NP. Theorem 5.1 is such a case; it extends 
significantly a result that was presented in [2]. 


5.1 A case in NP 


THEOREM 5.1. Consider two conjunctive queries with arith- 
metic comparisons, Qi and Q2 with ACs restricted as follows: 
All ACs are closed. The query Q2 has any AC and the query 
Qı has one closed RSI and at least one closed LSI. Then 
testing containment of Q2 to Qi is NP-complete. 


PROOF. Notice that we do not need normalization in this 
case, according to Theorem 2.8. First we prove the claim: 

Claim: If one closed RSI and any number of closed LSI 
are used in the query Q then there is one disjunct in the 
containment entailment such that all ACs in this disjunct 
except one are directly implied by the ACs in Q». 

Proof of the Claim: Suppose all disjuncts have more than 
one AC that is not directly implied. Since we have only one 
RSI in each disjunct, there is in each disjunct at least one LSI 
that is not directly implied. Thus, we can build a containment 
implication with all these LSIs on the rhs. This containment 
implication is not true because the algorithm AC-sat only 
uses one LSI from rhs to prove that the containment implica- 
tion is true (see Lemma A.1). Using one rhs AC to prove the 
implication is equivalent to stating that there is an AC from 
the rhs which is directly implied by 62; this is a contradiction. 

Now, we will prove that we need a polynomial number 
of mappings. We again rewrite the containment entailment 
equivalently in different ways in order to argue appropriately. 
In particular, we use the following observation: 

Observation: We can rewrite equivalently the implication 
a=>bVcasaA7b>c. 


IDEAS'19, June 10-12, 2019, Athens, Greece 


Thus, we consider the containment entailment with k + 1 
disjuncts on the right hand side written as: 


B5 > Di V D2- V Dy V (e1 ^ ea ^s) 


where we have written the first k disjuncts as D1, D2,... Dk 
and the disjunct Di44 that has the property that is specified 
by the Claim above is written more analytically as (e1 ^ e2 ^ 
--+), We assume that all e;, i = 2,... are directly implied by 
B2 (i.e., Bo => ei,i = 2,...) but e1 is not. Now we rewrite 
equivalently the containment entailment as: 


B5 => (Di V Da: V Dx V e1)A 


(Di V Da--- V Dk V e2) ^c 
Since we have b2 => e;,1 = 2,..., we also have for i = 2,... 
that 
b2 => Dı V D3--- V Dy V ei 
Thus the containment entailment can be equivalently written 
as: 
B5 => (Di V D2--- V Dx V e1) 
Or as: 
B2 Ane, > Dı V D2--- V Dk (2) 
Now, we call the above implication (2) a containment entail- 
ment although slightly abusively, since it does not come from 
any pair of CQAC queries. However, it has all the logical 
properties we used so far to rewrite equivalently the original 
containment entailment with k +1 disjuncts in the right hand 
side into the containment entailment in (2) with k disjuncts 
in the right hand side, i.e., with one disjunct less. Thus we 
can proceed in the same way on the containment entailment 
of (2) in order to write it equivalently as a containment en- 
tailment with k — 1 disjuncts in the right hand side, and so 
on. However, each time we reduce one disjunct, we add a 
conjunct, which is an AC, on the left hand side of the arrow. 
We only have polynomially many different ACs that we can 
construct from the variables and the constants we have in the 
original containment entailment. Moreover, the AC that we 
transfer on the left hand side (e.g., the —e1) is different than 
any other AC on the left hand side so far because we have 
assumed that it is not directly implied from the left hand 
side. T'hus, after removing polynomially many disjuncts from 
the original containment entailment, we have a containment 
entailment, where all the ACs on the right hand side are 
directly implied by the conjunction on the left hand side. Le., 
we have the following: 


Bo ^ Sel A 2e... => Dm 


where (without loss of generality) the ejs come from the 
original entailment from D; 41, Dm42,..., Dk-1, Dk. 

'Thus we have proved that the containment entailment is 
true iff there are Dm, Dm+1 ..., Dr, Dk+1 such that 


B5 => Dm V Dm4i-+: V Dx V Dy4a 


where k —m is a positive integer which depends on the number 
of constants and variables in the queries as a polynomial. 
Now, the certificate that will help us prove the problem 
in NP consists of all the mappings necessary to make the 
containment entailment true. We have proved that we only 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 8 


IDEAS'19, June 10-12, 2019, Athens, Greece 


need polynomially many of those. We also have to argue 
that we can prove in polynomial time that the containment 
entailment is true. For this, we use the Claim and Lemma 
A.2. We add to the certificate a total order on the mappings 
^i, i = 1,2,... that are given together with an AC from each 
^i(81) which is the AC that is not directly implied from the 
current lhs ACs, as we argued in the proof above about the 
polynomial number of mappings. Thus the total order and 
the special AC are as explained in this proof above. Now, it 
is easy to check in polynomial time that an AC is directly 
implied by the current lhs ACs. It is also easy to check in 
polynomial time whether a disjunction of two ACs is implied 
by the lhs ACs. In summary, we prove that the containment 
entailment is true by a careful book-keeping, i.e., in a certain 
order, which is given by the total order on the mappings and 
the special ACs. 


6 THE HOMOMORPHISM PROPERTY 
EXTENDED 


Many of the results about the homomorphism property carry 
over to other problems that are related to query contain- 
ment. In order to do that, however, we need to express the 
homomorphism property in a more general way than relating 
classes of queries. As we have seen in the theorem we stated 
in Section 4, the CQAC classes O1 and Q» where described in 
terms of what ACs are allowed in queries that belong to each 
class. This proves appropriate for the problems we extend 
the HP to, and hence, we define the homomorphism property 
by giving the types of ACs allowed in each class of queries, 
as follows: 


Definition 6.1. We say that a pair of sets of arithmetic 
comparison types (B1, B2) enables the homomorphism prop- 
erty if the homomorphism property holds from Qı to Q2, 
where Q; is the class CQAC queries with ACs from Bi, and 
OQ» is the class CQAC queries with ACs from Bo. 


Thus, Theorem 4.2 says that the pair (B1, B2) enables the 
homomorphism property where B; is (var € const, var < 
const, var — var, var — cons) and Bz is Tac — (var € const]. 


7 USE DOMAIN INFORMATION 


We begin with an example. 
Example 7.1. 
Qi : qi(Trans) :-a(Trans, X, Y), (Trans, X', Y), X' < Y', 
XzY 
Q» : q2(Trans) :-a(Trans, X, Y), a(Transl, Y, X), 
b(Trans, X Y), X' zY',X #Y 


Relation b stores the transaction ID in attribute Trans, 
with the date of payment (attribute X) and the date of 
delivery (attribute Y). Relation a stores the transaction ID 
in attribute T'rans, with the amount of prepayment (attribute 
X) and the amount of payment upon delivery (attribute Y). 
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Query Qı produces a report of the transactions’ ID such 
that the date of payment (attribute X) is before the date 
of delivery (attribute Y), and the amount of prepayment 
(attribute X) and the amount of payment upon delivery 
(attribute Y) are not the same. 

Query Q» is the same as Q1 only that requires that there 
is another transaction T'ransl, with the same amounts of 
prepayment and payment upon delivery in reverse. 

It is obvious that the inequalities Z and « do not interact 
because the first refers to an integer representing amount of 
dollars, whereas the other refers to dates. It is obvious that 
it does not make sense to compare amounts of money with 
dates. 


'Thus, the intuition is that a more efficient containment 
test may be possible in such cases. We will formalize it now. 


7.1 Definitions and Formal Presentation 


A connected component of a undirected graph is a maximal 
subgraph such that there is a path in the subgraph con- 
necting any pair of nodes of the subgraph. A bridge of an 
undirected graph if an edge, which, when removed, increases 
the connected components of the graph. We know that there 
is an orientation of the edges of an undirected graph which 
converts it to a directed graph which is strongly connected if 
and only if the undirected graph has no bridges. 

Let Q1 and Q» be CQAC queries for which we want to 
test whether Q» is contained in Qi. The following definition 
provides us with domain information: 


Definition 7.2. We consider the containment entailment 
for testing whether Q2 is contained in Qı. We form the 
containment entailment inequality graph as follows: It is an 
undirected graph with nodes the variables and constants 
appearing in the containment entailment. Moreover, there 
is an edge between two nodes if there is an AC in the con- 
tainment entailment connecting the corresponding variables 
or variable and constant. After removing all the bridges of 
this undirected graph we are left with a number of connected 
components that are pairwise disconnected. We call each of 
them a containment component or c-component. We call the 
set of nodes in each c-component a domain. 


Definition 7.3. For each domain, we consider its c-component, 
and we define the containment sub-entailment to be the con- 
tainment entailment where we have dropped all ACs that do 
not involve both variables/constants from this c-component. 
We form as many containment sub-entailments as we have 
c-components. Similarly, for each containment implication, 
we define the containment sub-implication. 


Now we prove that the homomorphism property for each 
domain derives the homomorphism property in general in 
the following theorem. 


THEOREM 7.4. Suppose that, for each domain, the ACs 
in its c-component are such that in Qi the ACs are of type 
from Bı, and in Q2 the ACs are of type from B2, and that 
the pair of sets (B1, B2) enables the homomorphism property. 
Then the homomorphism property holds from Q1 to Q». 
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PROOF. It is easy to see the following claim is true: 

Claim: For each containment implication, the following 
is true: The containment implication is true iff there is a 
domain, so that the containment sub-implication (for this 
particular containment implication) is true for this domain. 

The proof of the claim uses the fact that the c-components 
(which define domains) of the containment entailment in- 
equality graph form a tree. Hence, there are no cycles that 
the algorithm AC-sat can use that involves more than one 
c-component. 

If each domain has the HP, then for each containment sub- 
implication, there is a single AC on the right hand side which 
is implied by the left hand side. Now, we use Lemma 3.2 to 
conclude that there is a single disjunct in the containment 
entailment that is implied by the left hand side. 


8 REWRITING QUERIES USING 
VIEWS 


View is the name we use for queries when they define pre- 
computed data (this is one of the uses of views, but, for 
the purposes of this section we do not need to be broader 
than that; for more, see, e.g., [4]). The problem of answering 
queries using views [13] via rewriting is as follows: given a 
query Q on a database schema and views V over the same 
schema, can we answer the query using only the answers to 
the views via a rewriting? I.e., can we find a query P using 
base relations only from the relations of the views such as 
for any database, D, we have Q(D) = R(V(D))? We will 
discuss here the problem of finding equivalent rewritings in 
the language of unions of CQACs for CQAC queries and 
views. First we define a contained rewiting and then use 
this definition to define equivalent rewritings. We need the 
following definition: 


Definition 8.1. 'The expansion of a query P using views V, 
denoted by P**?, is obtained from P by replacing all the views 
in P with their corresponding base relations and comparisons 
from their definitions. Non-distinguished variables in a view 
are replaced with fresh variables in P**?, 


THEOREM 8.2. Given a query Q and a view set V, a query 
P is a contained rewriting of query Q using V if P uses only 
the views in V, and P**? C Q. Given a rewriting language 
L (e.g., conjunctive queries with comparisons), we call P an 
equivalent rewriting of Q using V with respect to L if P is in 
L, and PP = Q. 


THEOREM 8.3. For CQ queries and views, in the language 
of CQs contains a number of subgoals which is at most equal 
to the number of subgoals in the query. 


In the presence of ACs the picture changes though. We may 
have arbitrarily long contained rewritings as the following 
example shows (it gets intuition from Example 4.4). 


Example 8.4. 
Qı -a(X,Y),X =5,Y <5 
Vi: u1(X,Y) -a(X,Y,Y <5,X =5 
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V3: u3(X,Y) -a(X,Y),Y <5 
Vz : v2(X, Y) :-a(X, Y), X <5,Y <5 
This is a contained rewriting of the query Qı using the three 
views: 
R v1 (X1, X2), v2(X2, Xa), v2(X3, X4), 
v2(Xa, Xs), sree ,v2(Xn—2, X541), u3(Xn-1, Xn) 

The intuition is the following: Because of the definitions of 
the views, we know: variable X2 in R can be either equal to 
5 or less that 5. X1 is equal to 5 and if X» is less than 5 then 
we have found a mapping from the query to the expansion 
of R. Otherwise X2 is equal to 5 and thus if X3 is less than 
5 we have found a mapping. Otherwise X3 is equal to 5, etc. 
Up until we arrive (if we have to) at variable X,—-1 which 
equal to 5 and we know that Xn is less than 5, hence we 
have found a mapping. 


We compute the canonical rewriting of a query using the 
views as follows: First, we freeze the variables of the query to 
distinct constants and we compute the views on the thus cre- 
ated database. Then we de-freeze back the constants to their 
corresponding variables. The view tuples computed form the 
body of the canonical rewriting. Technically, computing the 
views on the database with the frozen variables is equivalent 
to finding a homomorphim from the view's subgoals to the 
query subgoals. Hence, we can derive the following theorem: 


THEOREM 8.5. Suppose query and views are CQs. Then, 
there is an equivalent rewriting in the language of CQs iff the 
canonical rewriting is such a rewriting. 


Theorem 8.5 can be extended to CQAC query and views 
only if the HP holds. 


8.1 Homomorphism property for query 
rewriting 


Definition 8.6. We say that the pair of sets of arithmetic 
comparison types (B1, Bz) enables the homomorphism prop- 
erty for equivalent query rewriting if 

a) (B1, B2) enables the homomorphism property. 

b) The views have types of ACs from B, and the query 
has types of ACs from Bo. 


We can define the canonical rewriting for the case there 
are arithmetic comparisons. We build it as in the case of CQs 
but we also have to satisfy the condition that the ACs in the 
query should imply the ACs in the views as they are mapped 
on variables of the query. Theorem 8.5 can be extended to 
CQAC queries and views when the HP holds: 


THEOREM 8.7. Consider query and views. The views have 
types of ACs from B, and the query has types of ACs from 
B». Suppose the pair of sets of arithmetic comparison types 
(Bi, Bz) enables the homomorphism property for equivalent 
query rewriting. Then the following holds: If there is an equiv- 
alent rewriting in the language of CQACs, then the canonical 
rewriting is such a rewriting. 


We present an example. 
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Example 8.8. We use the three views of Example 8.4 and 
the query: 
Q -a(X,Y),X«5,Y «5 
The canonical database of Q is (a(X,Y), X < 5,Y < 5}. 
We compute the views on it and we construct the canonical 
rewriting, enhanced with ACs appropriately: 


Rean :-03(X,Y),v2(X,Y),X <5,Y <5. 
Notice that R57 is equivalent to Q. 


8.2 MCRS AND CERTAIN ANSWERS 


When equivalent rewritings do not exist, then we find the next 
best thing which is maximally contained rewritings (MCRs). 
An MCR finds all “correct answers” (called certain answers). 
For CQs, i.e., for conjunctive queries without arithmetic com- 
parisons, there is an efficient algorithm for finding an MCR in 
the language of unions of CQs [14, 15, 17]. However, we can 
use the homomorphism property to extend this algorithm 
for special cases of CQACs too [2]. In this case we have the 
following definition: 


Definition 8.9. We say that the pair of sets of arithmetic 
comparison types (B1, B2) enables the homomorphism prop- 
erty for maximally contained query rewriting if 

a) (B1, B2) enables the homomorphism property. 

b) The query has types of ACs from B; and the views have 
types of ACs from Bo. 


For this section and next section, see [4] and references 
therein for an extensive exposition and related work. 


9 THE CHASE ALGORITHM 


One way to view the chase algorithm is as generalizing the 
algorithm that computes the canonical rewriting. The chase 
algorithm considers tuple generating dependencies and equal- 
ity generating dependencies. View definitions can be turned 
into tuple generating dependencies in a straightforward way. 
Thus, there is an alternative way to find the certain answers 
(for definitions of tuple generating dependencies and the chase 
algorithm see [4]). We turn the view definitions to tuple gen- 
erating dependencies and apply the chase algorithm on the 
view instance. Then we compute the query on the result 
of the chase algorithm. Another problem where the chase 
aglrorithm is useful is when we check query containment 
under dependencies. However, if we add arithmetic compar- 
isons to the tuple generating dependencies [5], then the chase 
algorithm does not work efficiently except in the case the 
homomorphism property holds for the tuple generating de- 
pendencies. We do not add details here, which can be found 
in [4]. However, we will explain informally on an example: 


Example 9.1. Consider the views and query in Example 8.8. 
The views can be written as tuple generating dependencies 
(tgd for short) as follows: 


W:a(X,Y)Y <5,X =5 >v (X,Y) 
V3 :a(X, Y), Y <5 > v3(X,Y) 
V2 :a(X,Y), X <5, Y < 5 > w(X,Y) 


Foto N. Afrati 


The canonical database of Q is (a(X, Y), X < 5,Y < 5]. 
The chase algorithm applied on {a(X, Y), X < 5,Y < 5} will 
work as follows. For each tgd it will check whether there is a 
homomorphism from its left hand side on (a(X,Y), X < 
5,Y < 5} that satsifies the ACs. If there is we add in 
(a(X, Y), X <5,Y < 5) a copy of the left hand side of the 
tgd — if there is not one. Thus, we end up with (a( X, Y), X < 
5,Y <5,u3(X,Y), v2(X, Y)}, which satisfies the given tgds 
because: for any homomorphism from the left hand side of 
tgd on (a(. X, Y), X <5,Y < 5,va(X, Y), va( X, Y)} there is 
an extension of this homomorphism to a homomorphism from 
the atoms of both sides of the tgd on this instance. Now, 
the canonical rewriting can be formed by considering the 
view atoms in the result of the chase and it is the same as in 
Example 8.8. 


'The following theorem states the property of chase that 
makes it useful: 


THEOREM 9.2. Let 3 be a set of tgds, and D a database 
instance that satisfies the dependencies in 3. Suppose IK is a 
database instance, such that there exists a homomorphism h 
from K to D. Let Ks be the result of a successful finite chase 
on K with the set of dependencies 3. Then the homomorphism 
h can be extended to a homomoprhism h’ from Ks to D. 


10 CONCLUSIONS 


The main novel technical contributions of this paper are a) big 
components of the proof technique (this technique appeared 
in sketch in [3]) that leads to results about the homomorphism 
property (i.e., Sections 2.2, 3 and the technical lemmas in the 
Appendix), b) the definition and formal proof of the results 
about domain information (this also appeared in preliminary 
form in [3]) and c) the result in Section 5.1. The result in 
Section 5.1 extends one that appears in [2] where containment 
is tested via a transformation to Datalog programs. 

We believe we have arrived close to the boundaries for 
CQAC query containment problems as to whether the query 
containment problem has the homomorphism property. In 
that respect we list some open questions: 

1. Lemma A.1 in the Appendix shows also in the direction 
that even in the case Q1 (we check whether Q» is contained in 
Q1) contains only LSI the problem may be II5-complete. The 
indication of this belief is that, in containment implications, 
sometimes even three ACs from the rhs are needed to prove 
satisfaction. Another candidate problem to be II5-complete 
is the following: When Qı contains two closed LSI and two 
closed RSI. 

2. The proof technique we have in detail here can be used 
to check individual cases of problems whether they have 
the homomorphism property and find all such cases. E.g., 
we conjecture that the following case has the homomorhism 
property: both queries use LSI arithmetic comparisons and 
they both have no constants in ordinary relational subgoals. 

3. Are there more cases besides the cases we presented 
here where query containment for classes of CQACs is in 
NP? Most importantly, do we need totally different proof 
techniques than the ones presented here? 
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4. We have not mentioned what happens when we have 
ACs of type var 4 const. In the following two examples more 
than one mapping is needed to prove that Q2 is contained in 
Qi, therefore homomorphism property does not hold. 


Example 10.1. Here we have Qı :-a(X,Y),Y # 5 and 
Q2 -a(W',W),alY, Z), Z < W. 


Example 10.2. Here we have Q1 :-a(X, Y), Y #Y and 
Q2 -a(W', W), a(Y, Z), a(U, X),X < Y, Z < W, W' < U. 


However we believe that it is not hard to prove that for 
the following case the homomorphism property holds: query 
Qı with Z and query Q2 has only SI. 
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A APPENDIX 
A.1 Technical lemmas 


The lemmas in this appendix are all of the same flavor, in 
that they have the same proof technique, thus, they could 
be stated in a single lemma with a long statement. We state 
them separately for clarity. 


LEMMA A.1. Consider the following implication: 
a AC ^... di Vdo V... 


where the c;’s and di’s are ACs and the conjunction of ACs 
€1 ^ €2 ^ ... is consistent (i.e., it has a satisfying assignment 
from the set of real numbers). Then the following is true: 

Suppose all cis are from the AC-family Tac (recall Defini- 
tion 4.1), i.e., any AC. Suppose dis are from the AC-family 
{var € const,var < const,var = var, const = var} (i.e., 
besides equality, we use only LSI ACs). Then the implication 
is true iff one of the following happens: 

(i) there is a single d; from the rhs such that 


GACA... — di 


or 

(ii) there are two ACs from the rhs, say di and dj such 

that 

Cc1^C2^...— di V dj 
The case (ii) happens only if i) there is a constant shared a) 
by di, b) by one from the ci's and c) by dj and ii) di is an 
open AC, 

or 

(iii) there are three ACs from the rhs, say di, dy and dj 
such that 

€1 ^ C2 ^ ... > di V dk V dj 
The case (iii) happens only if there are two LSI from the rhs 
and two LSI from the lhs, all four sharing the same constant 
and, in addition, there is a rhs AC of type var = var. 

One case where (i) happens always is when we do not have 
all of the following a) a rhs open LSI and a rhs closed LSI 
sharing the same variable and a b) a lhs equality of either 
type (i.e., either var = var or var = const). Another case 
where (i) happens always is when the lhs ACs do not include 
any closed LSIs. 


PROOF. Convention: We call the d;'s the rhs ACs (for right 
hand side ACs) and the c;'s the Ihs ACs (for left hand side 
ACs). 

In order to use the algorithm AC-sat, we write first the 
implication as ~E where 


B=a Aca... Andi And A... 


We consider the induced graph of the ACs in E and we apply 
the algorithm to prove that F is false. 
We consider the three cases of the algorithm AC-sat: 
Case 1. Consider a strongly connected component with 
two distinct constants cı and co. Without loss of generality, 
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suppose ci is adjacent to a rhs AC d; = cı € X. From X, 
there is a path to a constant c' Z c; (which is either co, which 
is Z cı by our assumption or another one) such that c’ is the 
first constant on this path. Now, if c’ < c1 then the edge from 
c' to c1 forms a cycle with onle one rhs AC on it (because 
all rhs ACs are related to a constant, since they are SIs). If 
c > c then the edge from ci to c' forms a cycle that does 
not contain d; = cı < X. Hence, we can proceed recursively 
until we find a cycle with only one rhs AC on it. 

Case 2. Consider a strongly connected component with 
at least one edge (say d; = A1 < A2) labeled by <. If this 
component has two distinct constants we argue as in case 1. 
Otherwise, it should have exactly one constant because the c; 
ACs are not contradictory, hence at least one rhs AC should 
be in this component. Consider arbitrarily one of those rhs 
ACs, say di = cı € X (or di = c1 < X whichever is the 
case). There is a path from X to A; and there is a path 
from A» to c1; moreover these two paths do not contain any 
rhs AC edge because such edges are adjacent to constants 
(by definition) and we have assumed that there is only one 
constant on this strongly connected component. Hence we 
have created a cycle with an edge labeled by « and with only 
one rhs AC on it. 

Case 3. Consider a strongly connected component with 
exactly one constant c. All the rhs ACs/edges are adjacent 
to this constant. Let c < X be such an rhs AC. For any node 
in this component there is a path to c and a path from X 
that and either path does not contain a rhs AC/edge. Hence 
a cycle is formed with only one rhs AC on it. Thus, if there 
is a Æ AC between two nodes A; and A2 of this strongly 
connected component, the set of contradictory ACs contain 
at most two rhs ACs (one for each Ai, A2) and the 4 AC 
(which is the negation of a d; which is an — AC). 


As regards the above lemma, it is convenient to define 
a sufficient set for the implication in the statement of the 
lemma: A sufficient set of rhs ACs is one for which its elements 
are sufficient to prove the implication if only those elements 
remain on the right hand side of the arrow of the implication. 
'Thus the main conclusion of the lemma can be equivalently 
stated as: 

There is a sufficient set of cardinality at most three. 


LEMMA A.2. Consider the following implication: 
G AC A. => di V do V... 


where the conjunction of ACs c1 \c2/... is consistent (i.e., it 

has a satisfying assignment from the set of real numbers) and 

the di’s are all closed SI (i.e., either LSI or RSI) comparisons. 

Then the implication is true iff one of the following happens: 
(i) there is a single d; from the rhs such that 


GACA... > di 


or 
(ii) there are two ACs from the rhs from which one is LSI 
and one is RSI, say d; and dj such that 


C1 ^ C2 ^ ... > di V dj. 


Foto N. Afrati 


The case (ii) happens only if the following is false: There 
is a constant c such that all constants in RSI ACs on the 
right hand side are greater than c and all constants in LSI 
ACs on the right hand side are less than c. 


PROOF. We form the induced directed graph as we did in 
the first lemma in this appendix, and we reason on this graph 
further. We use the algorithm AC-sat. Here only Case 2 
of the algorithm applies, i.e., there is a strongly connected 
component in the induced directed graph of the ACs with at 
least one rhs edge. We have two cases: Either this strongly 
connected component has only LSI or only RSI rhs ACs or it 
has of both kinds. In the first case, we argue as in the proof 
of Lemma A.1, only we have fewer cases since in the present 
lemma we only consider closed ACs. 

For the second case, suppose a strongly connected com- 
ponent has two rhs ACs which are X « a and Y » b and 
they successive, i.e., there is a path from Y to X that uses 
only ACs from the left hand side of the implication. Then 
we consider the cycle that contains both. Then either of the 
following happens: a) the edge joining a and b forms a cycle 
which only contains X « a and Y > b from the rhs, and 
thus, we have proved our result or b) the edge joining a 
and b forms a cycle contains neither X « a nor Y » b; so 
we proceed recursively considering now the new cycle that 
contains fewer rhs ACs on it. (Remember that the new cycle 
cannot contain only lhs ACs because we have assumed that 
they are consistent.) 


In a similar way we can prove the following lemma for SIs 
on the right hand side; this lemma is not used in this paper 
but it constitutes an interesting observation and it concludes 
the case with SI comparisons in the containing query or, as 
in the lemma, on the right hand side of the implication. 


LEMMA A.3. Consider the following implication: 
Ca AC ^... — di VdV... 


where the c;’s and d;'s are ACs and the conjunction of ACs 
€1 ^ €2 ^ ... is consistent (i.e., it has a satisfying assignment 
from the set of real numbers). Then the following is true: 

Suppose all cis are from the AC-family Tac, i.e., any AC. 
Suppose cis are from the AC-family (const < var, const < 
var,var # var,var = var,const = var,const < var} (i.e., 
they use SI ACs). Then there is a sufficient set of cardinality 
at most five. 

One case where it is guaranteed that there is a sufficient 
set of cardinality one is when there is a constant c such that 
all constants in RSI are greater than c and all constants in 
LSI are less than c. 

One case where it is guaranteed that there is a sufficient 
set of cardinality at most two is when all the rhs are closed 
SI ACs. 
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ABSTRACT 


During the last few decades the problem of community detection 
in social networks has become an important and challenging com- 
putational task. Consequently, a number of algorithms have been 
proposed in the relevant literature, some of which seem to solve 
the problem quite efficiently. The huge amount of data, however, 
forces for further improved techniques that can handle large and 
complicated networks. In this paper, we consider the effect of assign- 
ing weights on edges of unweighted network graphs and estimate 
their importance in community detection. In particular, we propose 
a new edge weight function and study its effect when used as a 
preprocessing step for community detection algorithms. Experi- 
mental results on a benchmark of random networks confirm our 
intuition that assigning weights on edges can play an important 
role in improving the performance of such algorithms. 


CCS CONCEPTS 


* Networks — Network algorithms; Social media networks; 
* General and reference — Experimentation, 


KEYWORDS 


Community Detection, Social Networks, Neighborhood Overlap, 
Edge Betweenness, Modularity, Spanning Trees. 


ACM Reference Format: 

Dora Souliou, Petros Potikas, Katerina Potika, and Aris Pagourtzis. 2019. 
Weight assignment on edges towards improved community detection. In 
23rd International Database Engineering & Applications Symposium (IDEAS’19), 
June 10-12, 2019, Athens, Greece. ACM, New York, NY, USA, 5 pages. https: 
//doi.org/10.1145/3331076.3331114 


All authors contributed equally to this work; the order of authors’ names is arbitrary. 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than ACM 
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 
to post on servers or to redistribute to lists, requires prior specific permission and/or a 
fee. Request permissions from permissions@acm.org. 

IDEAS’19, June 10-12, 2019, Athens, Greece 

© 2019 Association for Computing Machinery. 

ACM ISBN 978-1-4503-6249-8/19/06...$15.00 
https://doi.org/10.1145/3331076.3331114 


Petros Potikas 
ppotik@cs.ntua.gr 
School of Electrical and Computer Engineering 
National Technical University of Athens 
Zografou, Greece 


Aris Pagourtzis 
pagour@cs.ntua.gr 
School of Electrical and Computer Engineering 
National Technical University of Athens 
Zografou, Greece 


1 INTRODUCTION 


Different disciplines, such as computer science, society, economics, 
physics, and biology, model their complex data as networks. A net- 
work consists of nodes and edges connecting nodes, e.g., a social 
network consists of people (nodes) and relationships between peo- 
ple (edges). The rise of the Web and social media has created new 
challenges that require novel approaches and techniques. The prob- 
lem we address here is that of assigning the most proper weights 
on edges of a network in order to improve the modularity value 
achieved by certain known community detection algorithms. More 
specifically we assign weights on edges as a preprocessing step for 
the Louvain algorithm [1] as well as for the recently introduced 
ST algorithm [7]. We experimentally evaluate both approaches and 
compare them against the existing methods on benchmark net- 
works. In our first approach, we use the neighborhood overlap 
metric over the edge betweenness to assign weights on edges and 
then use these weights as an input to the weighted Louvain algo- 
rithm. In the second approach, we use the reciprocal fraction of the 
previous edge weight function and follow the minimum spanning 
tree-based approach of the ST algorithm [7]. 

Often, edges within the same community tend to have smaller 
edge betweenness centrality as compared to that of edges belonging 
to different communities. On the other hand, a small nover value 
of an edge indicates that its endpoints are likely to be in different 
communities. We therefore, use the ratio of the two quantities 
as a better indication of the degree of relationship between two 
connected nodes. 


11 Terminology 


Some useful terminology follows. For a graph G(V, E), which mod- 
els a network, where V is the set of nodes (users), and E is the 
set of edges (connections between nodes), we define the following 
notions and measures. 

Bridge and Local Bridge: In [4] a bridge is defined as an edge 
between nodes A and B that if deleted will place the two endpoints 
A and B into two different groups, i.e., if that edge was the only way 
to connect A and B. A local bridge is defined as an edge between 
nodes A and B and if deleted it would extend the length of the path 
between A and B. 
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Edge Betweenness. Edge betweenness (eb) of an edge e € E 
of a graph G defines how important that edge is with respect to 
shortest paths that connect each pair of nodes in that graph. More 
specifically, eb(e) sums, for all pairs of nodes i, j, the ratio of the 
number of shortest paths between i and j using edge e over the total 
number of shortest paths between i and j. One can assume that if 
much of the traffic of a network passes through an edge (assuming 
that traffic is routed through shortest paths) then this edge is more 
likely to connect different communities. 

Neighborhood Overlap: The neighborhood overlap (nover) of 
an edge (u, v) is the ratio of the number of common neighbors of 
both u and v to the number of nodes that are neighbors of either u or 
v. It is an embeddedness divided by the total number of neighbors 
of both nodes connected by that edge. 


[Nu N Noll 
[Nu U Noll 


If an edge is a local bridge then nover = 0. Hence, we can think of 
edges with very small nover value as being almost local bridges. 
Modularity: The quality of the partition of a graph into commu- 
nities usually is measured using a modularity principle as proposed 
by Girvan and Newman [9]. Modularity Q is a scalar value between 
—-1 < Q < 1, and it measures the connectivity density of the nodes 
within the same community to the expected connectivity density 
of a graph with random edges on the same nodes. The larger the 
modularity score, the more appropriate is the partitioning of the 
nodes into communities. It is used to compare the communities 
obtained by different algorithms/methods. It is calculated as, 


1 kik; 
Q= — m [au - 72 | 6e ep Q) 
Where m is the number of edges, Aj; is the weight of edge (i, j), ki 
is the degree of node i, cj is the community that i belongs to and ó 
is a function that is Ó(u, v) = 1, if u = v else 0. 


(1) 


nover(u, v) — 


12 Outline of the paper 


The rest of the paper is organized as follows. Section 2 discusses 
the related work. Section 3 describes how weights are assigned 
on edges and the effect of this assignment on existing algorithms. 
Section 4 presents the experimental results and section 5 concludes 
the paper. 


2 STATE-OF-THE-ART METHODS FOR 
COMMUNITY DETECTION 


Various approaches have been proposed in recent years to solve 
the community detection problem. Among them some of the most 
studied ones arehierarchical methods that are either divisive or 
agglomerative. 

In the seminal paper of Girvan and Newman [6] they define the 
eb centrality measure and propose the GN algorithm that uses this 
measure. GN iteratively removes edges of higher eb centrality, thus 
forming connected components that correspond to communities. 
The main disadvantage of this algorithm is that it is computationally 
expensive (since it recomputes the eb values for all edges in each 


step) and thus not scalable. The running time in the worst case is 
O([EP*|V|). 
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Recently, a new proposed algorithm named ST algorithm (it uses 
spanning tree in order to overcome the high computational cost 
of GN algorithm) [7], similar to this idea, is less time consuming 
while giving reasonable modularity score. 

A two phase algorithm for weighted graphs was proposed in [1], 
known as the Louvain algorithm, that runs in O(|E|) time. In the 
first phase, for each node it iteratively calculates the modularity 
obtained by including the node to the community of each of its 
neighbors, and then places this node into the community that gives 
the highest modularity. In the second phase, it creates a meta-graph 
in which communities are represented as meta-nodes and self- 
loops represent edges internal to the communities. The two phases 
are repeated on the meta-graph. This algorithm has a tendency 
to overlook small communities. In general, methods that use the 
modularity metric to optimize the community detection are known 
to suffer from the resolution limit effect [5], which refers to the 
fact that communities smaller than some threshold may not be 
discovered. Furthermore, the Louvain algorithm cannot efficiently 
explore the hierarchical structure of the network (if such a structure 
is present). 

The idea that the nover score, being a measure of the similarity 
of two neighbor nodes, may improve existing algorithms if used 
in a preprocessing phase is investigated in [7] with the so called 
nover Louvain algorithm, which indeed performed better in several 
random graphs. 

Yang et al. [11] use, the same as us, Lancichinetti-Fortunato- 
Radicchi benchmark graph to test eight state-of-the-art algorithms 
in order to find which is a good algorithm based on the properties 
of the graphs and other criteria. Some of the algorithms are: the 
Fastgreedy algorithm [3], which is a greedy community detection 
algorithm that is based on optimizing the modularity score, and the 
Label propagation algorithm [10], which considers each node to 
belong to the same community as the majority of its neighbours. 


2.1 Our contribution 


Our approach uses the eb centrality in conjunction with the nover 
metric in order to assign weights on the edges of graphs before 
calling the suitable community detection algorithm. The eb metric 
is also used by the GN algorithm; however, in contrast to the GN 
approach, we compute it once and use it in conjunction with the 
nover weights, in order to take a tree, thus considerably reducing 
the eb computations. The obtained results, from both modified 
algorithms, show that in most cases they perform better. Therefore 
weight values seem worth taking into further consideration. 


3 ADDING EDGE WEIGHTS TO 
UNWEIGHTED GRAPHS 


3.1 Edge weights for the Louvain algorithm 


In Algorithm 1 (novel Louvain), this preprocessing is the only mod- 
ification with respect to the original Louvain algorithm. Note that 
this increases by at most an O(A) factor the time complexity of the 
algorithm, where A is the maximum degree of the network. This is 
because we need O(|E|A) time for computing the nover of all edges, 
since computing the common neighbors of an edge can be done in 
O(A) time. Combining with the O(|E|) complexity of the original 
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Louvain and the eb computation O(|E||V |) ([2]), we get a total time 
complexity of O(|E|(A + V)). 


Algorithm 1 Louvain community detection with edge weights 
using nover and eb (novel Louvain). 


Input: G(V, E) 
Output: Set of communities C of maximum modularity Q 
for each edge e = (u,v) € E do 
nover(e) = |Nu O No] / |Ny U Nol 


compute eb(e) 
nover(e) 


eb(e) 


w(e) — 


end for 
C <— Louvain(G, w) 
return C 


3.2 Edge weights for the ST algorithm 


In Algorithm 2 (novelST), first we compute the weight of each edge, 
as the fraction of nover over eb. In the next step, we compute the 
minimum spanning tree of the graph using the weights of the first 
step. In the third step, we sort the edges in non-increasing order of 
their eb. Then, we iteratively remove one by one the edges of the 
spanning tree, starting from the first edge in the order. In each step 
ofthe iteration, we calculate the modularity of the communities that 
have been obtained and recalculate the edge betweenness in the rest 
of the tree. The spanning tree can be computed in an efficient way 
in a parallel manner. Furthermore, edge betweenness is computed 
in the resulted tree, adding little to the overall execution time. 


4 EXPERIMENTAL RESULTS 


In this section, we apply our ideas on existing methods, namely 
Louvain [1] and ST [7] and evaluate the given results. To accomplish 
this, we used synthetic networks produced by the benchmark of 
Lancichinetti et al [8] (LFR benchmark). For our experiments we 
used the python igraph library, in order to modify the existing 
implementations of Louvain and ST algorithm. 


4.1 LFR benchmark 


Our experiments use graphs with various number of nodes ranging 
from 500 to 3000. The value of the average node degree is equal 
to 10 and the maximum degree is equal to 15. 

We summarize our results in Table 1 and Table 2. 

The results presented in Table 1 show that in most cases the 
novel Louvain modularity slightly exceeds the one obtained by 
(standard) Louvain, and in the remaining cases the scores are quite 
close. Regarding ST algorithms, the novel ST algorithm constantly 
outperforms the ST algorithm of [7]. These results indicate that 
using edge weights often affects positively the quality of the results. 
Moreover, the number of communities given by each algorithm is 
not always the same. In Louvain-based algorithms only in a single 
case we obtain the same number of communities, while in the 
remaining cases the novel Louvain outputs a structure of higher 
granularity. Notably, the larger modularity increase is obtained 
when the number of communities found by novel Louvain is clearly 
larger than the ones found by Louvain; this matches the observation 
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Algorithm 2 Community detection by neighborhood overlap and 
minimum spanning tree (novelST). 


Input: G(V, E) 
Output: Set of communities C with maximum modularity Q 
for each edge e = (u,v) € E do 
compute eb(e) 
nover(e) = |Nu O Nol / |Ny U No| 
w(e) — eb(e)/ nover(e) 
end for 
G'(V, E) — Minimum Spanning Tree(G, w) 
for each e € E' do 
eb(e) — calculate Edge Betweenness on e 
end for 
Initialize C — (V), Q — modularity of C in G(V, E) » one 
community 
Sort all edges in E' in non-increasing order of eb(e) 
while E' is nonempty do 
Remove the edge e of highest eb(e) from E’ » next edge 
in sorted list of edges 
C' — community structure implied by E" » set of 
components, partitioning V 
Q' — modularity of C' in G(V, E) » modularity is wrt 
the original graph 


if Q' » Q then 
Q — Q’ 
Cez 
end if 
end while 
return C 


that Louvain fails to find small communities in large networks 
(resolution limit). 

The results obtained by the ST algorithms are summarized in 
Table 2 and show a clear superiority of the novel ST algorithm in 
all cases. It seems that the effect of adding nover/eb edge weights 
can be significant for MST-based community detection methods, 
whereas the effect on the Louvain technique requires further inves- 
tigation. Regarding the number of communities, we observe that 
novel ST usually finds fewer communities than the standard ST. 


4.2 Real-world example 


Moreover, we demonstrate the communities formed if we use the 
existing and the two proposed algorithms on the famous Zachary’s 
Karate Club dataset [12] and is shown in Figure 1. 

An illustration of the communities obtained by Louvain [1] 
and the algorithm ST in [7] on the famous Zachary’s Karate Club 
dataset [12] is presented in Figure 2 and Figure 4. Additionally, for 
our novel Louvain and novelST the communities can be seen in 
Figure 3 and Figure 5. By the comparison of Louvain and novel 
Louvain, both find four communities and have similar modularity 
close to 0.41. The ST finds three communities and modularity 0.37, 
and the novelST finds two communities with a lower modularity of 
0.33. Note, that the groundtruth for this dataset is two communities. 
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nodes Louvain novel Louvain 
communities | modularity | communities | modularity 
500 19 0.62 27 0.644 
600 10 0.569 10 0.567 
1500 33 0.656 38 0.662 
2000 40 0.666 42 0.67 
2500 40 0.671 58 0.668 
3000 49 0.680 74 0.683 


Table 1: Louvain method and Louvain method with weighted edges on LFS benchmark with average degree 10, maximum 
degree 15 


nodes ST novel ST 
communities | modularity | communities | modularity 
500 ji 0.484 5 0.532 
600 6 0.376 6 0.463 
750 12 0.496 6 0.516 
900 13 0.483 7 0.55 
1000 9 0.526 10 0.588 


Table 2: ST method and novel ST on LFS benchmark with average degree 10, maximum edge 15 


Figure 1: Zachary’s Karate Club. 


5 CONCLUSION Figure 2: Zachary’s Karate Club Louvain. 


In this paper, we propose the idea of adding edge weights on origi- 
nally unweighted graphs in order to see whether we can improve 
known community detection algorithms. We consider two basic 
approaches. The first one is the well known Louvain algorithm [1] 
and the second is the ST algorithm [7]. From the experimental re- 
sults we find out that assigning weights on edges can give increased 


but also to the metric we use for estimating the network structure 
given by each algorithm, namely modularity. As a future work, we 
modularity values, especially in the case of the ST algorithm. Fur- plan to investigate the importance of assigning various kinds of 
thermore, we notice that the structure of communities may also weights on edges in comparison to different community detection 


vary considerably. This is due not only to the weights assigned methods, possibly also using different evaluation measures. 
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Figure 4: Zachary’s Karate Club ST. 
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Figure 5: Zachary’s Karate Club novelST. 
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ABSTRACT 


Crime has been prevalent in our society for a very long time and it 
continues to be so even today. Currently, many cities have released 
crime-related data as part of an open data initiative. Using this as 
input, we can apply analytics to be able to predict and hopefully pre- 
vent crime in the future. In this work, we applied big data analytics 
to the San Francisco crime dataset, as collected by the San Francisco 
Police Department and available through the Open Data initiative. 
The main focus is to perform an in-depth analysis of the major 
types of crimes that occurred in the city, observe the trend over the 
years, and determine how various attributes contribute to specific 
crimes. Furthermore, we leverage the results of the exploratory data 
analysis to inform the data preprocessing process, prior to training 
various machine learning models for crime type prediction. More 
specifically, the model predicts the type of crime that will occur 
in each district of the city. We observe that the provided dataset is 
highly imbalanced, thus metrics used in previous research focus 
mainly on the majority class, disregarding the performance of the 
classifiers in minority classes, and propose a methodology to im- 
prove this issue. The proposed model finds applications in resource 
allocation of law enforcement in a Smart City. 
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1 INTRODUCTION 


The concept smart cities encompasses several initiatives that are 
supported by modern technology and aim at improving the lives 
of the people living within the city in various domains like urban 
development, safety, energy and so on [19]. One of the factors 
that determine the quality of life in a city is the crime rate therein. 
Although modern cities might offer a lot of technological advance- 
ments, the basic requirement of citizens’ safety still remains an 
open problem [11]. 

Crime continues to be a threat to individuals and to our society 
and demands serious consideration if we aim at reducing the onset 
or the repercussions caused by it. Hundreds of crimes are recorded 
daily by the data officers working alongside the law enforcement 
authorities throughout the United States. Many cities have signed to 
the Open Data initiative, thereby making this crime data accessible 
to the general public. The intention behind this initiative is increas- 
ing the citizens’ participation in decision-making and utilizing this 
data to uncover interesting and useful facts [7]. 

The city of San Francisco is one of many that have joined this 
Open Data initiative. The data scientists and engineers working 
alongside the San Francisco Police Department (SFPD) have recorded 
over 100, 000 crime cases in the form of police complaints they have 
received [6]. With the help of this historical data, many patterns 
can be uncovered. This can help us predict crimes that may happen 
in the future and thereby help the city police better safeguard the 
population of the city. 

Motivated by the ideal scenario, where every citizen lives in a safe 
environment and neighborhood, we propose some methodologies as 
well as some initial results that might help the law enforcement of a 
city predict and tackle crime. We employ the crime data set reported 
by SFPD over a period of 15 years (2003 to 2018) and analyze them 
to identify the trends of crimes over the years and predict crimes 
that might happen in the future. Compared to previous work that 
has worked with the same data, our proposed data preprocessing 
methodology improves prediction for the highly imbalanced dataset 
[1, 4, 15, 23]. We should point out that, even though our proof-of- 
concept in this work employs the San Francisco crime dataset, a 
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similar approach can be followed to analyze to any city’s or region’s 
crime data, so we hope that our approach can help with crime 
prevention on a larger, national and international level. 


11 Problem Formulation 


The problem being tackled in this paper can be best explained in 
two distinct parts: 

We first perform exploratory data analysis to identify crime 
patterns by: 


e Utilizing the crime data set by the SFPD, to observe existing 
patterns in the crime throughout the city of San Francisco. 

e Determining the classes of crimes within different areas in 
the city, and analyzing the spread and impact of the crime. 

e Studying the crime spread in the city based on the geographi- 
cal location of each crime, the possible areas of victimization 
on the streets, seasonal changes in the crime rate and the 
type, and the hourly variations in crime. 


In the second part, we employ machine learning to generate a 
prediction methodology to identify the type of crime that can take 
place in the city, at several levels: 


e Using the discovered patterns of crime identified during the 
exploratory analysis part, we inform and improve the data 
pre-processing process. 

e Building a prediction model that treats this problem as a 
multiclass classification problem, by classifying new raw 
(unclassified) data into one of the crime categories (classes), 
thereby predicting the crime that can occur. 

e Addressing the problem of an imbalanced dataset, by in- 
troducing additional data preprocessing tasks aiming at im- 
proving the precision and recall for all classes (including 
the minority classes) of our data. This improves previous 
research works that have been proposed on the same dataset. 


For the exploratory data analysis, we employ various data analyt- 
ics tools, along with Spark for initial data preprocessing, to analyze 
the spread of the crime in the city, and find the crime classes. For 
the machine learning/prediction part, in order to build a prediction 
model, we build upon the first part and use different types of algo- 
rithms, such as K-Nearest Neighbor, Multi-class Logistic Regression, 
Decision Tree, Random Forest, and Naive Bayes. 

The rest of the paper is organized as follows: in Section 2 we 
present an overview of the related work; our design and implemen- 
tation details are presented in Section 3, while the results of our 
analysis and experimental evaluation are included in Section 4. We 
conclude with our plans for future work in Section 5. 


2 RELATED WORK 


Over the years, there have been a lot of studies involving the use of 
predictive analytics to observe patterns in crime. Some of these tech- 
niques are more complex than others and involve the use of more 
than one data sets. Most of the data sets used in these researches are 
taken from the Open Data initiative [7] supported by the govern- 
ment. In this section, we will study the various techniques used by 
different authors which will help answer questions such as: what is 
the role of analytics in crime prediction, what techniques are used 
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for data preprocessing and what are the classification techniques 
which have proved to be most efficient. 


2.4 Temporal and Spectral Analysis 


A lot of research in the area of crime analysis and prediction re- 
volves around the analysis of spatial and temporal data. The reason 
for this is fairly obvious as we are dealing with geographical data 
spread over the span of many years. 

The authors of [17] have studied the fluctuation of crime through- 
out the year to see if there exists a pattern with seasons. In their 
research, they have used the crime data from three different Cana- 
dian cities, focusing on property related crimes. According to their 
first hypothesis, the peaks in crime during certain time intervals 
can be distinctly observed in the case of cities where the seasons 
are more distinct. Their second hypothesis is that certain types of 
crimes will be more frequent in certain seasons because of their 
nature. They were able to validate their hypothesis using Ordi- 
nary Least Squares (OLS) Regression for Vancouver and Negative 
Binomial Regression for Ottawa. Since their research focused on 
crime seasonality, a quadratic relationship in the data was predicted. 
Crime peaks were observed in the Summer months as compared to 
Winter. 

In a similar study, the authors of [2] have analyzed the crime 
data of two US cities - Denver, CO and Los Angeles, CA and provide 
a comparison of the statistical analysis of the crimes in these cities. 
Their approach aims at finding relationships between various crim- 
inal entities as this would help in identifying crime hotspots. To 
increase the efficiency of prediction, various preprocessing tech- 
niques like dimensionality reduction and missing value handling 
were implemented. In the analysis, they compared the percentage 
of crime occurrence in both cities as opposed to the count of crimes. 
Certain common patterns were observed in both the cities such 
as the fact that Sunday had the lowest rate of crime in both the 
cities. Also, important derivations like the safest and the most noto- 
rious district were noted. Decision Tree classifier and Naive Bayes 
classifier were used. 

L. Venturini et al. [22] have discovered spatio-temporal patterns 
in crime using spectral analysis. The goal is to observe seasonal 
patterns in crime and verifying if these patterns exist for all the 
categories of crime or if the patterns change with the type of crime. 
The temporal analysis thus performed highlights that the patterns 
not only change with the month but also with the type of crime. 
Hence, the authors of [22] rightly stress the fact that models built 
upon this data would need to account for this variation. They have 
used the Lomb-Scargle periodogram [18] to highlight the season- 
ality of the crime as it deals better with uneven or missing data. 
The AstroML Python package was used to achieve this. In their 
paper they have described in detail how every category of crime 
performs when the algorithm is applied to the data. Further, the 
authors suggest that researchers should focus on the monthly and 
weekly crime patterns. 


2.2 Prediction using Clustering and 
Classification techniques 


The authors of [20] have described a method to predict the type 
of crime which can occur based on the given location and time. 
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Apart from using the data from the Portland Police Bureau (PPB), 
they have also included data such as ethnicity of the population, 
census data and so on, from other public sources to increase the 
accuracy of their results. Further, they have made sure that the data 
is balanced to avoid getting skewed results. The machine learning 
techniques that are applied are Support Vector Machine (SVM), 
Random Forest, Gradient Boosting Machines, and Neural Networks 
[20]. Before applying the machine learning techniques to predict 
the category of the crime, they have applied various preprocessing 
techniques such as data transformation, discretization, cleaning 
and reduction. Due to the large volume of data, the authors have 
sampled the data to less than 20,000 rows. They used two data 
sets to perform their experiments - one was with the demographic 
information used without alterations and in the second case, they 
used this data to predict the missing values in the original data 
set. In the first case, ensemble techniques like as Random Forest or 
Gradient Boosting worked best, while in the second case, SVM and 
Neural Networks showed promising results. 

Since a smart city should give importance to the safety of their 
citizens, the authors of [11] have designed a strategy to construct a 
network of clusters which can assign police patrol duties, based on 
the informational entropy. The idea is to find patrol locations within 
the city, such that the entropy is maximized. The reason for the need 
to maximize the entropy is that the entropy, in this case, is mapped 
to the variation in the clusters, i.e. more entropy means more cluster 
coverage [11]. The dataset used for the research is the Los Angeles 
County GIS Data. The data has around 42 different crime categories. 
Taking the help of a domain expert, the authors have assigned 
weights to these crimes based on the importance of the crime. 
Also, the geocode for each record is taken into consideration and 
the records that do not have a geocode are skipped. Because the 
authors in [11] are trying to maximize the entropy in this case, 
by considering the equation Hc1 = —p(c1)/np(ci). The probability 
p(c1) is defined as the ratio of the weight of the centroid of the 
crime to the weight of the system, plus the ratio of the quickest 
path between two centroids, to the quickest path in the whole 
system. 

The authors of [9] have taken a unique approach towards crime 
classification where unstructured crime reports are classified into 
one of the many categories of crime using textual analysis and 
classification. For achieving this, the data from various sources, 
including but not limited to the databases which store information 
about traffic, criminal warrants of New Jersey (NJ) and criminal 
records from NJ Criminal History, was combined and preprocessed. 
As a part of the preprocessing activity, all the stop words, punc- 
tuations, case IDs, phone numbers and so on were removed from 
the data. Following this, document indexing is performed on the 
data to convert the text into its concise representation. In order to 
identify the topics or specific incident types from the concise repre- 
sentation, the authors used Latent Semantic Analysis (LSA). Next, 
the similarity between these topics was identified using the Topic 
Modeling technique where the closer the score is to 1, the more 
similar it is to the topic which was followed by Text Categorization. 
The classification methods used in this research were Support Vec- 
tor Machines (SV M), Random Forests, Neural Networks, MAXENT 
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(Maximum Entropy Classifier), and SLDA (Scaled Linear Discrimi- 
nant Analysis). However, the authors observed that SVM performed 
consistently better of them all. 


2.3 Hotspot Detection 


A crime hotspot is an area where the occurrence of crime is high as 
compared to other locations [8]. Many researchers have taken an 
interest in determining crime hotspots from the given dataset. The 
authors of [8] mainly discuss two approaches for detecting hotspots 
- circular and linear. The authors also discuss the fundamentals of 
Spatial Scan Statistics, as a useful tool for hotspot detection. The 
results on the Chicago crime data set are also discussed in detail 
using both the approaches. 


3 DESIGN AND IMPLEMENTATION 


The fundamental goal of this work is to build a model, that can 
predict the crime category that is more likely to happen given a 
certain set of characteristics like the time, location, month and so on. 
Also, we take the help of statistical and graphical analysis to help 
determine which attributes contribute to the overall improvement 
in the Log Loss score. Our proof-of-concept application focuses on 
the San Francisco crime dataset. We used parallel processing using 
Apache Spark. Apache Spark is a big data tool which distributes the 
data over a cluster and achieves parallel processing. It has become 
popular in the recent few years [12]. 


3.1 Overview of the data set 


We used the San Francisco crime data set [7]. The data set consists 
of the following attributes: 


e IncidntNum: the incident number of the crime as recorded 
in the police logs, it is analogous to the row number, 

e Descript: brief description of the crime and provides slightly 
more information than the Category field but is still quite 
limited, 

e DayOfWeek (Date): day of the week when the crime oc- 
curred: Monday, Tuesday, Wednesday, Thursday, Friday, Sat- 
urday, Sunday (exact date of the crime), 

e PdDistrict: police district the crime occurred in, San Fran- 
cisco has been divided into 10 police districts: Southern, Ten- 
derloin, Mission, Central, Northern, Bayview, Richmond, Tar- 
aval, Ingleside, Park, 

e Resolution: resolution for the crime, one of these values: 

Arrested, Booked, None, 

Address: street address of the crime, 

X (Y): longitudinal (latitudinal) coordinates of the crime, 

Location: a pair of coordinates, i.e. (X, Y), 

Pdld: a unique identifier for each complaint registered for 

database update or search operations, 

e Category: category of the crime, originally, there are 39 dis- 
tinct values (such as Assault, Larceny/Theft, Prostitution, etc.) 
and it is also the dependent variable we will try to predict 
for the test set. 


There are about 1.4 million rows and the size of it is approximately 
450 MB. It contains data from the year 2003 to (February) 2018. A 
snapshot of the actual data set is shown in Figure 1. 
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IncidntNum Category Descript DayOfWeek 
160919032 VANDALISM MALICIOUS MISCHIEF, VANDAL Friday 
160920976 ASSAULT THREATS AGAINST LIFE Saturday 
Date Time PdDistrict Address 
11/11/16 7:00 MISSION 2600 Block of MASON ST 
11/12/16 2:58 CENTRAL FILLMORE ST / GEARY BL 
X Y Resolution Location 
-122.4052518 37.751525 NONE (37.75152495730467, -122.4052517658 
-122.4140032 37.8079695 ARREST, BOOKED (37.80796947292687, -122.4140031783 
PdId 


16091903228160.00 


16092097619057.00 San Francisco Police Complaints dataset (2003 - 2018) 


Figure 1: Snapshot of the actual Data set 


3.2 Data Preprocessing 


For our preprocessing, we employ Apache Spark. This provides 
several advantages, especially in terms of distributed and parallel 
processing. It can also significantly decrease the processing time of 
such a huge volume of data. 

The implementation of the rest of the project has been done 
using Python and hence we have used the PySpark distribution of 
Spark for preprocessing. 

The data set is mostly complete with no null values. However, 
there are a few outliers which must be handled (see 3.2.1). The 
dataset provided a lot of potential to extract more meaningful in- 
formation from the existing columns. Hence, a few columns have 
been added or transformed to improve the score of the resulting 
prediction. The decision to add or transform columns has been 
taken by studying the graphical analysis which has been performed 
on the data prior to building a model. 


3.21 Data Cleaning. One of the primary steps of data cleaning 
is outlier detection. Using the longitude/latitude coordinates, we 
identify 196 outliers that fall outside the minimum boundary of San 
Francisco and filter them out. 

The next step in data cleaning is taking care of incorrect or 
missing data. Although the data set does not contain Null or missing 
values, the Category column does contain a few columns which 
have been incorrectly labeled, like the TREA category which should 
actually be TRESPASSING. 

There are 39 distinct categories in the data set. However, some 
of the categories are very similar to each other. For example, when 
the Category column contains values or keywords like INDECENT 
EXPOSURE or OBSCENE or DISORDERLY CONDUCT, we can group 
those together in one category PORNOGRAPHY/OBSCENE MAT. 
The decision on which categories should be clubbed together is 
taken by looking at the Description column of the data set which 
provides more information on what the corresponding Category 
column represents. The complete list is presented for reference in 
Table 1. 


3.22 Data Transformation. Data transformation is one of the most 
important data preprocessing techniques. Usually, the data is origi- 
nally present in the form that makes more sense if it is transformed. 
In this case, the main transformations performed are as follows: 
Extracting Information from Other Attributes: On taking a closer 
look at the Description column, it is observed that it contains a lot 
of useful information which has not been captured in the Category 
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Description Containing New Category 
License, Traffic, Traffic Violation 
Speeding, Driving 
Burglary Tools, Air Gun, Deadly Tool Possession 
Tear Gas, Weapon 
Sex Sexual Offenses 
Forgery, Fraud Fraud/Counterfeiting 
Tobacco, Drug Drug/narcotic 
Indecent Exposure, Pornography/obscene 
Obscene, Disorderly Conduct Mat 
Harassing Assault 
Influence Of Alcohol Drunkenness 


Table 1: Extracting Information from Description Column 


column. For example, although the Description column explains 
that the crime has something to do with WEAPON LAWS, the Cate- 
gory column has classified it under OTHER OFFENSES. This might 
cause us to miss out on significant information. Hence, we extract 
such information from the Description column and rename the 
categories in the Category column. The complete list is shown in 
Table 2 for reference. 


Original Category containing New Category 
Weapon Laws Deadly Tool Poss. 
BadCheck, Counterfeit., Embezzl. Fraud/Counterfeiting 


Suspicious Occ Suspicious Person/act 


Warrants Warrant Issued 


Vandalism Arson 


Table 2: Combining Similar Categories 


Feature Extraction: There exist several features like Address, 
Time, Date, X and Y which can be transformed into new features 
that hold more meaning as compared to the existing ones. Hence, 
all of these features have been used to generate new features and 
some of these old features have been eliminated. 

Address to BlockOrJunc: In its original form, the Address feature 
has a lot of distinct values. Thus, if given a logical consideration, 
it is not hard to realize that the exact address of the crime might 
not be repeated or be useful in predicting the type of crime in 
the future. However, this column can be used to see if the crime 
occurred on a street corner/junction or on a block. We can also 
check if there exists a pattern among certain types of crime to occur 
more frequently on a street corner rather than a block. To achieve 
this, a simple check of whether ’/’ occurs in the address or not, is 
performed. If it does contain it, it means that the crime occurred on 
a corner and we return 1, otherwise it is a block and we return 0. 

Time to Hour: The Time feature is in the Timestamp format. 
It would be interesting to observe patterns in crime by the hour. 
Hence the Hour field is extracted from the Time field. It is worth 
noting that if the minute part is greater than 40, i.e. if the time is 
for example, 12 : 42, then the hour is rounded off to 13, otherwise 
it would be 12. 
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Date to Season, Day, Year and Month: The Date field is a very 
important one for prediction. Using this single field, we are able to 
extract four features. Spark provides inbuilt methods to extract the 
Day, Month and Year from the Date and hence our script makes use 
of the same. After extracting the Month from the Date, we make 
use of this feature to extract the Season. 

X and Y to Grid: The X and the Y coordinates provide the exact 
location of the crime. However, we can see some interesting patterns 
on dividing the entire San Francisco area into 20X20 grids. This 
is inspired by the work of [15], who give specific details on the 
formula used for the generation of these 400 cells. 


3.23 Data Reduction. As previously mentioned, there are 39 cat- 
egories of crime in the original data set. Some of them include la- 
bels like NON-CRIMINAL, RECOVERED VEHICLE and SECONDARY 
CODES. Since we are trying to predict the future occurrences of 
crimes, it is essential to have categories pertaining to actual criminal 
activities. However, the above labels do not provide any additional 
information to help us achieve our goal. Thus, these categories are 
completely filtered out from our data set. This reduces the number 
of rows from about 2.1 million to about 1.9 million after all the 
preprocessing. 


3.3 Classification Techniques 


Classification techniques are used to automatically put the data 
into one or more categories also known as classes. 

We focus on Pigeonhole Multiclass Classification algorithms. 
Multiclass Classification involves classifying the data into more 
than two classes. One of the most common types of Multiclass 
Classifiers[14] is the Pigeonhole Classifier, where every item is 
classified into only one of the many classes. Hence, for a given item, 
there can be only one output class assigned to it. Below, we briefly 
describe the classification techniques that we used in our analysis. 


(1) Naive Bayes classifier is a supervised learning algorithm 
which is based on the Bayes’ theorem. The Bayes’ theo- 


rem can be stated as shown in P(A|B) = P(A) wee ; 
P(A|B) is the conditional probability of A happening given 
that B is true, similar for P(B|A), P(A) and P(B) are the in- 
dividual probabilities of A and B happening independently. 
The Naive Bayes classifier relaxes the conditional depen- 
dence assumption of the Bayes Theorem, introducing the 
"naive" assumption that there exists independence between 
all pairs of features. Although these classifiers are fairly sim- 
ple, they tend to work very well in a large number of real 
world problems. 

(2) Decision Tree classifiers use decision trees to make a pre- 
diction about the value of a target variable. The decision 
trees are basically functions that successively determine the 
class that the input needs to be assigned. Using decision 
trees for prediction has many advantages. An input is tested 
against only specific subsets of the data, determined by the 
splitting criteria or decision functions. Another advantage 
is that we can use a feature selection algorithm in order to 
decide which features are worth considering for the decision 
tree classifier. The fewer the number of features, the better 
will the efficiency of the algorithm be [21]. 


where 
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To construct a decision tree, generally a top down approach 
is applied until some predefined stopping criterion is met. 

(3) Random Forest classifiers generate multiple decision trees 
on different sub-samples of the data while training, and then 
predict the accuracy or loss score by taking a mean of these 
values. This helps to control over-fitting that might happen 
when a single decision tree is used, as this algorithm is biased 
towards always selecting the same root of the tree (the one 
that gives the less entropy after the split. 

To alleviate this problem, in Random Forests the split for each 
node is determined from a subset of the predictor variables 
which are randomly chosen at the given node [16]. 

(4) K- Nearest Neighbor (KNN) classifiers classify data into 
one of the many categories by taking a majority vote of 
its neighbors. The label is assigned depending on the most 
common of the categories among its neighbors. The number 
of neighbors to consider is a user-defined parameter K that 
is set after experimentation. 

(5) Multinomial Logistic Regression classifiers are a generalized 
version of Logistic Regression for multiclass problems like 
ours. The log odds of the output are modeled as a combi- 
nation of the various predictor variables [5]. There are two 
variants of Multinomial Logistic Regression based on the 
nature of the distinct categories in the dependent variable- 
nominal and ordinal [10]. Multinomial regression uses the 
Maximum Likelihood Estimation (MLE) method. Logistic Re- 
gression is a discriminative classifier [13](Ch 7). This means 
that it tries to learn the model based on the observed data 
directly and makes fewer assumptions about the underlying 
distribution. 


4 EXPERIMENTAL EVALUATION 


4.1 Exploratory Data Analysis 


We begin by exploring our data. Exploratory data analysis is the 
first step of any big data analytics process. Using graphs we can 
get useful and interesting insights into our data. This step will also 
help us make data preprocessing decisions, such as which features 
to include for predictions. Some of these graphs show interesting 
patterns in crime, which might not be apparent otherwise. 

Figure 2 shows the trend of the crime over the years in various 
districts (neighborhoods) of San Francisco. These are the Police Dis- 
tricts and each of those include many other city districts. Looking at 
this graph, we can observe that the crime in SOUTHERN, CENTRAL 
and NORTHERN districts is on the rise. On the other hand, crimes 
in TENDERLOIN and INGLESIDE have declined over the years. 

Figure 3 shows how crimes happen on different hours of the day. 
We can observe that there is a clear pattern in crime and the hour 
of the day. Generally, the crime rate is low in the early morning 
hours from around 3:00 AM to 6:30 AM and it rises to its peak in the 
evening rush hours, i.e., from 4:30 PM to 7:00 PM and is generally 
high at night. However, it would be really interesting to see if this 
pattern is followed by all the different types of crime. For this, we 
plot graphs for the top four crimes that we found interesting. 

Figure 4, focuses on Theft/Larceny crimes per hour. It pretty 
much follows the trend of the previous graph. 
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Figure 2: Rate of Crime per District by Year 
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Figure 3: Rate of overall crime every Hour 
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Figure 4: Rate of Theft/Larceny by the Hour 
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However, the same pattern is not followed by other crime types. 
For example, as shown in Figure 5 that plots Prostitution crimes, 
there are clear areas where Prostitution is high as compared to 
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others and we can also see that Prostitution is higher during mid- 
night and late hours (something that was expected). However, it is 
also very high around 11 : 00 AM in the Central district, which is 
unusual and can be further looked into by the police department 
and law enforcement agents. 
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Figure 5: Rate of Prostitution by the Hour 
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Figure 6: Sum of Drugs/Narcotics cases per Year 


In Figure 6, we can see that the Drug/narcotics related crimes 
were highest in the year 2009 followed by 2008. Anyone even 
slightly familiar with San Francisco might mention that the Ten- 
derloin is one of the most notorious districts in San Francisco with 
a high crime rate, especially, with high rate Drugs and Narcotics 
related crimes, and this impression is supported by the numbers 
shown here. From Figure 6 we can see that Tenderloin district 
has the highest number of Drug related crimes till 2009. However, 
in recent years, these crimes have seen a huge dip, going down 
by more than 50% since 2009. This might be due to the fact that 
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SFPD has focused their efforts on fighting crime in this notoriously 
crime-prone neighborhood. 

A great way to study the growth or decrease in the rate of crime 
is by using area charts. An area chart is another way to look at the 
growth (or fall) rate in the data In Figure 7 we study the rise in 
the number of thefts over the years in most of the districts in San 
Francisco, except Tenderloin and Taraval. On the other hand, by 
plotting the area chart of Drugs and Narcotics as shown in Figure 8 
we can see a clear decrease in these crimes in San Francisco. 
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Figure 7: Area of Theft/Larceny by the Year 
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Figure 8: Area of Drugs/Narcotics by the Year 


4.2 Comparison with existing results 


As discussed previously, several researchers have also worked with 
the SF crime dataset. In this section, we provide a comparative 
analysis. 

Our data preprocessing results in a reduction in the number of 
rows in the dataset from 2.19 to 1.92 million. We split the dataset 
to training and test chronologically as follows: as training data we 
use data from year 2003 to year 2015, consisting of 1, 636, 217 rows; 
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as test data we use data from year 2016 to year 2018, consisting of 
284, 165 rows. 

Following the practice of researchers in related work, we use 
the Log Loss score for our models. In this scoring metric, false 
classifications are penalized. The less the Log Loss score, the better 
is the model. For a perfect classifier, the Log Loss score would be 
zero [3]. 

Mathematically, the Log Loss function is defined as follows: 


1 N M 
TN bs >, yij log pij 
i=1 j=l 
where N is the total number of samples, M is the number of distinct 
categories present in the output variable, yj; takes the value of 0 or 
1 indicating if the label j is the expected label for sample i and pij 
if the probability that label j will be assigned to the sample i [3]. 

While we cannot directly compare the results, as we cannot 
know how the other researchers split their datasets, we can gauge 
how well our approach is performing compared to this of previous 
work, and also provide some insights on which classifier seems to 
work best for this particular dataset. We first review the features 
used for building the various models in other existing approaches 
and their reported results. 

Source 1 [4]: The authors have used the features DayOf Week, 
PdDistrict, X, Y, Month, Year, Hour and Grid(of 8 X 8) for the predic- 
tion. Their best model is Random Forest with LogLoss = 2.496, with 
second best the Decision Tree with LogLoss = 2.508. The authors 
also evaluated Naive Bayes and Logistic Regression. 

Source 2 [1]: The authors have used Hour, Month, District, Day- 
Of Week, X, Y, Street No., Block and 3 components of PCA. Their best 
model is Random Forest with LogLoss = 2.366, with second best the 
KNN classifier with LogLoss = 2.621. The authors also evaluated 
Naive Bayes. 

Source 3 [23]: The attributes/features used for prediction in 
this work are Year, Month, Hour, DayOf Week, PdDistrict, X, Y and 
Block/Junction. Their best model is Logistic Regression with LogLoss = 
2.45. The authors also evaluated KNN, which yielded very high log 
loss. 

Source 4 [15]: The features used for prediction are Hour, Day- 
Of Week, Month, Year, PdDistrict, Season, BlockOrjunction, CrimeRe- 
peatOrNot, Cell and 39-d Vector. The authors only evaluated Logistic 
Regression, with LogLoss = 2.365 

As shown in Table 3, in our approach, Random Forest is also the 
best model. In terms of Log Loss, our model yields the best results 
among the reported related work ones, with LogLoss = 2.276 while 
the second best model is the Decision tree (LogLoss = 2.3928). 

One important aspect, left out by the previous papers focusing on 
crime classification in San Francisco, is the issue of data imbalance. 
The data set is highly skewed, as shown in the sum of the distinct 
categories of Figure 9. We discuss how we address this problem in 
what follows. 


4.3 Improving Classification of Imbalanced 
Datasets 


Most of the existing work uses accuracy or Log Loss score to evalu- 
ate the efficiency of the model. However, these metrics provide an 
overall assessment of the classifier, without focusing on how well 
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Algorithm Log Loss 
Random Forest | 2.2760 
Naive Bayes 2.5008 


Logistic Regres. 2.4042 


KNN 2.4634 


Decision Tree 2.3928 


Table 3: Results of Experiments (Log Loss) 


Category New 


400K 


300K 


200K 


Count of Category New 


100K 


OK 


S 


^ NENNEN 
=a 
orc a 


ARSON 
ASSAULT 
SPAS 


TRE 


[d 
L| 
VEHICLE THEFT EE 
WARRANT ISSUED [BENI 


SEXUAL OFFENSES I 


STOLEN 


FRAUD/COUNTERFEI 
SUSPICIOUS PERS 


Figure 9: Count of Distinct Categories in the Dataset 


the classifier does for each class. Accuracy measures the percentage 
of correct predictions overall predictions, so even if the classifiers 
don't work well with minority classes, accuracy can still be very 
high. The Log Loss metric does discriminate among different classes, 
however, it weighs each type of misclassification equally. Again, a 
similar misleading result might be calculated, if the classifier works 
well for the majority classes (which is most often the case, as it is 
trained using such an imbalanced dataset). Instead, we need the 
model to correctly identify maximum samples but at the same time, 
we want those correctly identified samples to include the minority 
classes as well, in other words increasing precision and recall for 
each and every class in the model. 

Looking at the SF crime data, we observe that even after pre- 
processing, the dataset is imbalanced with the LARCENY/THEFT 
category acting as the majority class. We tried three techniques to 
handle the imbalance: oversampling the minority classes, oversam- 
pling the majority class, and adjusting weights on the classifiers. 
However, none of them showed a significant improvement in the 
Recall or Precision scores. Hence, the following preprocessing was 
performed in addition to the approaches described previously: 
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(1) The LARCENY/THEFT category was split taking into consid- 
eration the Description column. It was observed that sepa- 
rating out the samples with Grand Theft From Auto in their 
description proved to be a good split. The resulting classes 
were LARCENY/THEFT and THEFT FROM AUTO. 

(2) Combined classes with less than 2000 samples into OTHER 
OFFENSES category. 

(3) Created a new category called VIOLENT/PHYSICAL CRIME 
which includes former categories of ARSON, WEAPON LAWS, 
VANDALISM and instances of ROBBERY, where physical 
harm or guns were involved. 


This made the dataset more balanced (see Figure 10) than the 
original set. We can observe this by comparing the recall of the 
model for the original (Figure 11) and the balanced (Figure 12) 
dataset. 
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Figure 10: More Balanced data set 
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Figure 11: Recall for imbalanced dataset 


5 CONCLUSION AND FUTURE WORK 


In this work, we conducted a detailed analysis of the Open Data set 
of crime activity over 15 years for the city of San Francisco. We per- 
formed exploratory data analysis and extensive data preprocessing. 
Compared to previous work, we tried to alleviate the problem of an 
imbalanced dataset in order to improve the results of multi-class 
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Figure 12: Recall for the more balanced dataset 


classification. As a part of the future work, we plan to evaluate 
how other classifiers, such as neural networks, can be employed 
to further improve the results of the classification process. We also 
plan to enhance this dataset with additional metadata, such as pop- 
ulation, housing and transportation data to gain more insights on 
the crime prediction process. Finally, we should stress that the pro- 
posed approach can be applied to other cities’ crime datasets and 
see if there are any similarities and differences depending on the 


region. 
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Classification of eye-state using EEG recordings: speed-up gains 
using signal epochs and mutual information measure 


Phoebe M Asquith 
Cardiff University 
Cardiff, UK 
asquithpm @cardiff.ac.uk 


ABSTRACT 


The classification of electroencephalography (EEG) signals is useful 
in a wide range of applications such as seizure detection/prediction, 
motor imagery classification, emotion classification and drug effects 
diagnosis, amongst others. With the large number of EEG channels 
acquired, it has become vital that efficient data-reduction methods 
are developed, with varying importance from one application to 
another. It is also important that online classification is achieved 
during EEG recording for many applications, to monitor changes as 
they happen. In this paper we introduce a method based on Mutual 
Information (MI), for channel selection. Obtained results show that 
whilst there is a penalty on classification accuracy scores, promising 
speed-up gains can be achieved using MI techniques. Using MI with 
signal epochs (3secs) containing signal transitions enhances these 
speed-up gains. This work is exploratory and we suggest further 
research to be carried out for validation and development. Benefits 
to improving classification speed include improving application in 
clinical or educational settings. 


CCS CONCEPTS 


* Mathematics of computing — Graph theory; Time series 
analysis; e Applied computing — Psychology; - Hardware 
— Sensor devices and platforms. 
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electroencephalogram (EEG) analysis, eye-blink detection, time 
series analysis, graph theory applications, psychology. mutual in- 
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1 INTRODUCTION 


Since its invention in 1929 [5], the electroencephalogram (EEG) has 
allowed the recording and interpretation of the electro-magnetic 
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activity of neurons, from the scalp. Research using this technology 
has allowed crucial insights into the sleep wake cycle (e.g. [8]), 
neuropsychological abnormality (e.g. [17]), functional networks in 
the brain (e.g. [7]) and neural development (e.g. [6]). 

Recently, identifying eye-state using EEG has become of interest 
with findings that eye-state behavior such as blink frequency can 
demonstrate stress response [10] or an underlying neuropsycho- 
logical problem [16]. EEG signal changes related to eye-state have 
often been identified by separating raw data into different frequency 
bands [18]. However, this does not allow for online classification 
of eye-state. 

More recently the use of portable EEGs has become more preva- 
lent, with the development of innovative technologies (see Fig. 1). 
Research has demonstrated that with use of portable headsets, the 
eye-state of a participant can be identified using the raw time-series 
recorded at different channels, rather than separating data into dif- 
ferent frequency bands [19]. Despite some concerns around the 
measurement capabilities of the headsets, the potential of portable 
devices in current and future research is recognized within the field 
(e.g. see [21] for review in educational research). Portable EEGs are 
easier to implement than traditional EEGs and can be used with 
subjects "in the field" or who may have difficulty sitting still (e.g. 
young children). Having online eye-state classification capabilities 
with this portable technology is an exciting step towards a dynamic 
resource in cognitive-neuroscientific research. 


Figure 1: Example of portable EEG use with children. 


The application of machine learning methods for the classifica- 
tion of EEG signals has been widely explored in the last two decades. 
Examples include methods for feature selection and optimization 
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as in [2] and channel selection as in [1, 22], amongst others (see 
[15] for a wide range of machine learning methods applied). 

In previous work EEG signals have been used to classify eye- 
state relatively successfully using Incremental Attribute Learning 
(IAL) with extended timeseries [19]. Epochs of ten seconds have 
also been adequate for identifying drowsiness from eyestate [23]. 
However, to be useful as an online classifier, a shorter snapshot of 
data must be used to identify eye-state rather than an extended time- 
series, to reduce calculation time and processing power. This is also 
important for identifying blinks, which typically last 100-400ms 
[14]. Levy [13] explored the effect of epoch length on signal analysis 
of the EEG and found that epochs as short as 2 seconds could be 
used for intraoperative EEG monitoring. For eye-state classification 
in particular, it has been demonstrated that a snapshot of EEG signal 
time-series in the alpha frequency range can be used to identify 
eye-state, rather than an extended time-series [3]. 

In this research, we provide experimental analysis for sample 
size reduction based on a method to capture signals in discrete 
EEG signal slices compared to longer EEG signal time-series. Ad- 
ditionally, we investigate the effect of possible signal redundancy 
on classification scores and computational performance. This will 
be investigated using the raw EEG data rather than splitting it into 
different frequency bands, therefore eliminating data preparation 
steps. 

Results show that with both channel selection and sample re- 
duction methods, we could accomplish comparable classification 
results with KNN, Support Vector Machines (Classifier: SVC), KNN 
and RF when run on the entire dataset containing signals from all 
channels. Additionally, outcomes suggest that significant computa- 
tional speed-up could be achieved using a Mutual Information (MI) 
measure for EEG 


2 DATASET 


The data corpus explored in this work was collected and com- 
piled by Roesler [15], and provided for open access on UCI data 
repository[12]. The dataset comprises of raw electro-magnetic 
recordings taken from the scalp of one participant and information 
about eye-state (eyes open or closed) over the same time period. 
The participant was asked to relax, look forwards towards a camera 
and blink naturally, without restriction [9]. While looking toward 
the camera a video was recorded of the eye. Once recorded the 
video data was coded; binary labels were used to identify the two 
different eye-states; '1' for an "eye-blink" and ’0’ for "eye-open" state 
— the distribution of eye-states over the course of the recording in 
the dataset is shown in Fig. 2. 

During this time, recordings were also taken at the scalp us- 
ing the Emotiv EEG Neuroheadset, which measured the electro- 
magnetic signal at 14 electrode positions (see Fig. 3 - note that 
two electrode positions where excluded, as indicated in the figure). 
14980 sequential timepoints (observations) were recorded from each 
of the 14 EEG channels (features), The recording took place over 
117 seconds period (this is a rate of 128Hz) and measured signals 
were stored as floating-point values. 

Initial exploration of the dataset indicated three outliers (value 
» 10x the average recording) , which were removed. Observations 
were therefore reduced to 14977 at each electrode; each of these 14 
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Figure 2: Eye-state distribution of the recorded observations. 


Figure 3: The 14 EEG channels compiled by Roesler[15]. Ex- 
cluded channels are circled in red. 


timeseries represented the signal variability of an electrode over the 
experimental period. The timeseries were then normalised using 
zero centring ("de-meaning" applied) to explore the positive and 
negative deviation from their mean, as an indicator of signal simi- 
larity — centred signals are shown in Fig.4 . By eye, the timeseries 
show overall similarity across the different electrodes. 

For example, a strong signal similarity can be observed when 
looking at the time-series of AF3 and F7 as in Fig. 5. 

Similarities across the EEG timeseries are observable overall. 
Indeed, if we separate all EEG time-series relative to eye-state and 
cross-correlate, topological patterns in signal variability across the 
channels exist, see Fig. 6 (note that order of variables varies across 
the matrices to facilitate visualisation of possible similarities). The 
correlation between electrode signals is also seen to change during 
eyes-close state compared to eyes-open. 
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Figure 4: (a) Sliced window of the 14 EEG channel centered time-series- um is signal voltage. Signals show similarity across 


the EEG channels. (b) Only one channel (AF3) signal is shown. 
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Figure 5: Signal time-series for AF3 and F7, (b) shows a smaller time-slot of the same time-series. 
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Figure 6: Correlation matrices for an 'eye-blink' time-series in (a), and 'eye-open' in (b). Hierarchical clustering [11] is applied 


to cluster higher-correlated channels together. Note the difference of order in each matrix. 
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3 METHODS AND FINDINGS 


Patterns between EEG channel signals relative to eye-state could 
be further explored by techniques from graph theory. Linear cor- 
relations between the time series T;(t,) and Tj(ty) (the Pearson 
correlation coefficient Rij) given by 


L 
2 Ti(ty)Tj(ty) 
Rij = 7 = s (1) 
Y È TONÈ TE) 
ki k=1 


is widely used [20], whereby strong linearity between two channels 
can be expressed as a link between two graph nodes. Having derived 
the correlation matrix C, a threshold r is usually applied to define 
strong similarities between graph nodes as ‘links’. The adjacency 
matrix A for the graph is then found by 


Aij = Aji = O(Cij - 7) - ôij, (2) 


where © is the Heaviside function and 6 is Kronecker delta. Graphs 
based on the two different eye-states have been constructed, con- 
sidering different values for © — see Fig. 7. 
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(e) eye-blink, © = 0.7 (f) eye-blink, © = 0.8 

Figure 7: Different graphs constructed from 14 EEG channel 
time-series, relative to eye-state and a value for O; strength 
of linear similarity. 
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The constructed graphs (small, provided the number of channels) 
show a strong dissimilarity in topological structure between first 
set (eye-open signals) and the second (constructed from eye-blink 
signals). Studying metrics such as the average degree for nodes 
of both types of graphs can be used to quantify the topological 
similarity further. 

Based on observed similarities (linear similarity explored here), 
we argue that machine learning methods should provide compara- 
ble results if, on the one hand features’ space is reduced based on 
the relevance of features and their 'score' of redundancy, and on 
another, the similarity between signals can be captured in shorter 
time-series of signals. To test both assumptions we sliced the pro- 
vided time-series for all channels into time-windows of 3 seconds 
(384 timepoints, collected at a rate of 128Hz), containing a transition 
between eye-blink and eye-open and vice versa. — Fig 8 shows a 
time-series window of 7 seconds for demonstration purposes. 


2 
2 


AF3 signal from mean 


Figure 8: Example of F7 signal time-series window — window 
here is of 7 seconds. Light blue represent recordings in the 
closed-eye state and dark blue eyes-open. 


20 time-series slices (windows of 3s length each) were generated 
for each channel, resulting in the total number of observations 
reduced to 7,680. We then implemented a filtering approach based 
on mutual information (Mj;), given by 


Pij(Ti, Tj) 


P,(T)P/(T)) 6) 


Mij = 2 Pij(Ti, Tj) log 
Ti,Tj 
Here Pj(T;) is the probability density function (PDF) of time series 
Ti, and Pij(Ti, Tj) is the joint PDF for (Tj, Tj). 

The minimum redundancy maximum relevance (mRMRe) algo- 
rithm [4], as a filtering method, uses differences of Mj; to compute 
the degree of dependency between multiple random variables. The 
method then sequentially compares the relevancy/redundancy bal- 
ance of information between variables, providing scores for both 
their relevance and redundancy. The variable (channel time-series) 
selected at each step is the one with the highest score. A negative 
score indicates a redundancy final trade of information and a pos- 
itive score indicates a relevancy final trade of information. The 
scoring results averaged for the 20 time-series slices is shown in 
Fig 9. 
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Speed-up gains using signal epochs and mutual information measure 


EX o o1 
Redundancy Relevancy 


Figure 9: The variable selected at each step is the one with 
the highest scores (the variables are ordered in the plot). The 
scores at selection can thus be read from the diagonal. 


Accordingly we ran a series of experiments on a 3.2 GHz Intel 
Core i5 processor with 16 GB 1600 MHz DDR3 memory, which 
included a base run of different machine learning methods for clas- 
sification; Nearest neighbors (KNN), Logistic Regression, Support 
Vector Machines classifier (SVC) and Random Forests (RF). The base 
run included the tuning of K for nearest neighbors and regularisa- 
tion, using grid search methods, for SVC hyper-parameters (C-10, 
gamma-0.001) with radial basis function (rbf) kernel. RF is run 
with bootstrapping enabled and the number of selected features set 
to 'automatic', in order to decrease variance amongst constructed 
member trees. 

K-fold validation score of F1 metric was obtained over the entire 
dataset of 14977 time-series (117 seconds and sampling rate of 
128Hz). Although F1 score's value is not the main concern here, a 
comparison between classification methods' accuracy can be found 
in [15] and in [14] for deep learning architectures' performance. 
The resulting classification F1 Score and processing speed where 
then compared to those of 3 different experimental settings, for 
KNN, Logistic Regression, SVC and RF, which included: 


e (A) 9 features from mRMRe analysis are selected from the 
available 14. Entire dataset of 14977 time-series is used to 
learn the classifiers. Accuracy and performance results com- 
pared to base run are shown in Table 1 below: 
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e (B) 9 features from mRMRe analysis are selected from the 
available 14. 7,680 observations (based reducing time-series 
to 20 sliced windows containing eye-state transition) used 
for learning. Accuracy and performance results compared 
to base run are shown in Table 2 below: 


Classifier | F1 Score gain | Speed-up gain 
KNN -0.4 4.3x 

LogReg -0.6 2.5x 

SVC -0.7 5.6x 

RF -0.7 3x 


Table 2: Accuracy and processing time comparison between 
base run and experiment B. 


e (C) 14 features (no channel selection) on the 7,680 observa- 
tions are used to learn a classifier. Accuracy and performance 
results compared to base run are shown in Table 3 below: 


Classifier | F1 Score gain | Speed-up gain 
KNN -0.1 not observed 
LogReg -0.17 2.3x 

SVC -0.3 1.9x 

RF -0.63 3x 


Classifier | F1 Score gain | Speed-up gain 
KNN -0.2 2.1x 

LogReg -0.36 2x 

SVC -0.3 3x 

RF -0.5 0.3x 


Table 1: Accuracy and processing time comparison between 
base run and experiment A. 


Table 3: Accuracy and processing time comparison between 
base run and experiment C. 


The obtained results show some classification score penalty in 
most runs of the presented methods, generally, yet speed-up gain 
is promising. The accuracy score declines more as both feature 
reduction as well as data slicing method are applied together, how- 
ever, processing speed-up gains are maximised. Further tuning to, 
namely, SVC is believed to show better scores by the application 
of this work’s methods, which we aim at focusing on for future 
developments. That said, the method introduced can be vastly use- 
ful for the analysis of higher-dimensional EEG/MEG signals which 
are typically characterised by the existence of both redundancy of 
information in their signals, and most importantly, noise. 


4 CONCLUSIONS AND FURTHER WORK 


This paper presented a brief literature on the analysis of the elec- 
troencephalogram (EEG) signals and the application of their analy- 
sis. Information obtained by Emotiv headsets on subjects/humans 
include signal time-series from different electrodes, which typically 
exhibit variability, in response to events designed for a study of 
interest. The resolution of collected data as well as the quantity of 
time-series which could be obtained by such devices is, increasingly, 
producing both opportunities to gain further insights into brain 
functionality, and a challenge on the analysis side; accuracy and 
efficiency. In the presented work, we showed that efficiency could 
be improved with some (arguably marginal) penalty on a range 
of popular machine learning accuracy outcomes that are applied 
for the analysis of EEG data. The introduced method assumes that 
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much of EEG signal information can be captured by (A) signals in 
a subset of EEG channels, which we filtered by the application of 
mRMRe technique and (B) signal information from discrete time- 
series slices (3secs) which contain signal (eye-state) transitions. 
Experimental results obtained show that both assumptions hold for 
the classification of eye-state from 14 EEG channels based on the 
dataset provided by Roesler, see Section 2. 

When developing this work, results should be considered along- 
side deep neural network architectures which have shown low 
convergence times and may be applicable in real-time classification 
of eye-state [14]. 

Slicing frequency and the number of features/channels to select 
have been done heuristically here, and therefore, based on these 
preliminary outcomes, we hope to validate the presented method 
and obtained outcomes on larger datasets, in future work. Also, we 
believe that more noticeable gains in learning efficiency should be 
possible for datasets of significantly higher-dimensional features’ 
spaces. 


CODE AVAILABILITY 


Authors will provide link to code and dataset in camera-ready 
manuscript. 
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ABSTRACT 


Data publishing is a challenging task from the privacy point of view. 
Different anonymization techniques are proposed in the literature to 
preserve privacy in accordance with some mathematical constraints. 
Disassociation is one of the anonymization techniques that relies on 
the k™” — anonymity privacy constraint to guarantee a certain level of 
privacy for set-valued datasets (e.g., search and shopping items). Dis- 
association separates a set-valued dataset by clustering the dataset 
into groups of records with common frequent items, and then split- 
ting each cluster into record chunks respecting k” — anonymity. In 
this paper, we define a new ant-based clustering algorithm based on 
the disassociation technique to keep some of the items associated 
together throughout the anonymization process. We define these 
associations as utility rules that should be treated with eagerness 
while anonymizing the data. We perform a set of experiments to 
evaluate our algorithm w.r.t. these utility rules. 


CCS CONCEPTS 

* Security and privacy Usability in security and privacy; 
Data anonymization and sanitization; Privacy-preserving pro- 
tocols; e Computing methodologies Artificial life. 
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1 INTRODUCTION 


Data publishing has become a challenging task considering all the 
disciplines that are involved in the process. In fact, the privacy of the 
data is a major concern that increases with the complexity and the 
size of the data. Many are the anonymization techniques, presented 
in the literature, that can be employed to protect the privacy of the 
users in the data [4, 8, 15, 19, 21, 24, 25, 27]. However, anonymizing 
the data is a burden on the utility. Set-valued data provides enormous 
opportunities for various data analysis, for that reason, a trade-off 
between data privacy and data utility must be found. The aim is not 
only to provide a good anonymization of the data but also to make 
the output valuable for future analysis. In this work, we focus on 
one technique: disassociation, presented in [22], which divides a set- 
valued dataset into clusters and then each cluster into record chunks 
preserving the k™” — anonymity privacy constraint. Disassociation is 
studied in [2] where a privacy breach defined as the cover problem 
was found. We propose a solution for this cover problem in [1]. In 
this work, we study the problem of publishing predefined set of asso- 
ciations, we call utility rules, under the disassociation technique and 
with rigorous attention to their utility. Actually, the horizontal parti- 
tioning in disassociation groups data records in clusters using a naive 
similarity function between the records. Under this perspective, we 
propose a variation of ant-based clustering methodology, to increase 
the utility of predefined associations. Swarm intelligence is the do- 
main of studying the social behaviors of swarms like ant colonies, 
bird and fish schools. For more than three decades, swarm intelli- 
gence has been flourishing and used effectively in multiple fields 
to solve optimization problems like the traveling salesman problem 
using the ant colony optimization (ACO) [7], constructing portfo- 
lios of stock in the financial field using particle swarm optimization 
(PSO) [28] and finding the best position to hide information using 
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Table 1: Notations used in the paper 


a set of items 


a table containing individuals related records 


a record (of 7 ) which is set of items associated with a specific individual of a population 


D 

7 

F? a table anonymized using the disassociation technique 
E 

I 


an itemset included in D 


s(I, 7) | the number of records in 7 that are superset of T 
a cluster in a disassociated dataset, formed by the horizontal partitioning of 7 
C? a vertically partitioned cluster C that results in record chunks and a term chunk 
Rc a record chunk from the vertically partitioned cluster C* 
Tc the term chunk from the vertically partitioned cluster C* 
ô maximum number of records allowed in a cluster, also know as the maximum cluster size 


the cat swarm optimization (CSO) [26]. Through the observation 
of the collective behaviors of decentralized, self-organized natural 
systems, it is fascinating to discover how with limited individual 
abilities, swarms working together, can accomplish complex tasks. 
Inspired by the behavior of real ants and their pheromone-based 
communication, we present a variation of ant-based algorithm to 
cluster the data related to the utility rules for the disassociation. 

The rest of the paper is organized as follows: First, Section 2 
describes the problem of the data utility preservation under a specific 
anonymization technique the disassociation, and evaluate it theoreti- 
cally. Then, Section 3 recalls some of the traditional and the swarm 
intelligence based data clustering methods with their advantages 
and limitations. To solve our problem, we propose a variation of 
an ant-based clustering technique described in Section 4. Finally, 
Section 5 investigates the efficiency of our solution and its impact 
on the utility of aggregate analysis for a predefined set of association 
rules. We conclude the article in Section 6 and present outlines for 
future work. 


2 PROBLEM DEFINITION AND 
BACKGROUND 


2.1 Problem definition 


The disassociation technique, as proposed in [22], is driven by the 
idea of ensuring the k"' — anonymity privacy constraint by separating 
the terms of a record in multiple record chunks within a cluster. This 
process creates ambiguity for an association between its separated 
terms, which causes a reduction of the utility for the association in 
question. Disassociation works on the assumption that there are no 
specific associations more valued than others and that data items 
must not be altered, generalized or suppressed. In this paper, our 
aim is to provide a better utility for a set of predefined associations, 
called the utility rules, by reducing the amount of split-ups a utility 
rule has to endure in order to preserve k"* — anonymity. Formally, 
k™ — anonymity is defined as follows: 


Definition 2.1 (k™ — anonymity). Given a dataset of records T 
whose items belong to a given set of items D. The dataset 7 is 
k™ — anonymous if VI € D such that |I| € m, the number of 
records in J that are superset of I is greater than or equal to k, i.e., 
sI, T) = k. 


In what follows, we review the disassociation technique under the 
perspective of the utility of associations. Table 1 recalls the basic 
notations used in this paper. 
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2.2 Disassociation and utility awareness 


In this work, we are interested in one anonymization technique, the 
disassociation, which relies on k’” — anonymity to guarantee the data 
privacy. We dedicate this section to show how this technique is a 
two-sided coin for publishing data: 

e the first side of the coin is concerned with the data utility and relies 
on the record's clustering, 

e the second side of the coin is involved in attaining a privacy level 
through terms' split-ups. 

We use Fig-1 to illustrate an example of disassociation, applied with 
k=3,m=2and6 - 6. 
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Figure 1: Example of disassociation 
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Figure 2: Example of a utility driven disassociation 


Horizontal partitioning as presented in [22], clusters the records 
using a naive similarity function. Records are grouped together in 
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clusters of maximum size equal to 6, based on a common frequent 
term. The authors of [22] justify the use of the naive method to 
two main complaints regarding the similarity functions: first, they 
are inefficient on large datasets, and second, they do not explicitly 
control the size of the clusters. However, this process of horizontal 
partitioning based on the support of items, doesn’t take into account 
the associations in the data. 

Horizontal partitioning as shown in Fig-1(b), groups records that 
contain the most frequent item, a having s(a, 7 ) = 6, within cluster 
C1, and all the other records within C2. Both clusters have a size less 
than ô, then C, and C2 are vertically partitioned. 


Vertical partitioning is the process of creating for each cluster, 

record chunks that verify the k” — anonymity privacy constraint, and 
a term chunk that contains the items having a support less than k in 
the cluster. 
In our example, vertical partitioning is applied over each cluster 
returned by the horizontal partitioning, C1 and C2, splitting the items 
into different record chunks with respect to the k"' — anonymity 
privacy constraint. The final result of disassociation is shown in 
Fig-1(c). 


To illustrate the problem of utility, let's consider that the frequency 
of the association (b, c), is important in the analysis after data pub- 
lishing. Unfortunately, we can barely extract valuable information 
about the association between items b and c from Fig-1(c). In Ct 
both items are added to the term chunk because their support was 
less then k = 3, showing neither the association between them nor 
their real support. Similarly, association (b, c) is unclear in C5, with 
only one advantage over C7; knowing the support of item b. 

Let's suppose that there exists a clustering technique that favors 
the association (b, c) while disassociating 7 , and brings together 
all the records related to it as in Fig-2(a). After applying vertical 
partitioning in Fig-2(b), the association (b, c) is totally preserved 
associated. Now, any analysis over the support of (b, c) is accurate. 
Hence, data utility depends essentially on horizontal partitioning. 
From this example, we deduce that the need to give a data analyst 
the ability to define a set of associations, we call utility rules, that 
are important in future analysis is crucial. Those utility rules must 
be preserved carefully within the anonymized data. 


2.3 Utility rules 


Giving an exact general definition for the utility of the data in the 
domain of anonymization is irrational. Generally, the utility is the 
quality of data for the intended use and it expends within different 
definitions in the literature. Some works consider the utility as a prac- 
tical guide to reduce the extent of data generalization [11] whereas 
in [9] a clustering based technique is implemented to minimize the 
abstraction. In this work data utility, is considered for the aggregate 
query answering accuracy. Frequent or not, we assume that any util- 
ity rule should be well represented after disassociation. Let us first 
formalize our utility rules and their context: 

e Let T = {r1,..., rn} be the original dataset of records. Every 
record rj € 7 is a set of items r; = {y1, ..., Yp } related to a specific 
individual. 

e Let UR = (uri, ..., Uru} be a dataset of predefined associations, 
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where each utility rule ur; = (x1, ..., xq) is a set of items from the 
same domain of 7 . In this work we consider that every utility rule 
exists at least once in the original dataset: V ur; € UR, dr € 7 such 
that urj € r. 

e Let s(ur, 7 ) be the support of the utility rule ur from UR in the 
original dataset, which is the number of records from F containing 
ur. 

e Let s(ur, 7 *) be the support of the utility rule in a disassociated 
dataset. 

e Let conf(ur, 7 ) be the confidence of the utility rule ur in the origi- 
nal dataset 7 , defined as: 


s(ur, 7 ) 

IT] 
e Let conf(ur, T*) be the confidence of the utility rule ur in the 
disassociated dataset 7 *, defined as: 


IT 
conftur, T ^) = M 
We know that 7* can have a maximum size equal to that of the 
original dataset, 7 , and for the following analysis we consider: 
slur, 7 *) 
conftur, J^) = ———— 
I] 
Definition 2.2 (a — confidence). We say that a utility rule ur is 
a — conf ident with: 


conf(ur, T) = 


. conf(ur, 7 *) 
— conflur,T) 


We use the term confidence, to determine the strength of the 
association between the items of a utility rule ur after disassociation. 
Statistical queries are based on the support of the associations in 
question. The a — confidence represents a percentage of the original 
support of a utility rule that must be reflected in the final output of 
the disassociation for the utility rule. 

In what follows we evaluate theoretically the utility of an association 
under k™ — anonymity. 


2.4 The utility privacy trade-off in disassociation 


The privacy constraint of disassociation forces any output of the data 
to be k™ — anonymous in every record chunk and no cluster can be 
larger than ô. In what follows, we evaluate a — confidence under the 
k™ — anonymous privacy context of disassociation. 

We formalize our context for the following analysis: 

e Let Ri, be a record chunk from the vertically partitioned cluster 
Cr. 

e. 'Let X = argminīcur,IeRic s(I, Ric) be the subset of the utility 
rule ur, found in a record chunk Rie from cluster € having the 
minimum support between all subsets of the different record chunks. 
e Let Y = {yly C ur, y € Ric} V X be the set of the subsets of the 
utility rule ur, except X, found in the record chunks of the cluster 
Ce 

e. Let Z = {ele € TC,e € ur} be the set of items of the utility rule ur 
that belong to the term chunk TC of the cluster C7. 


Definition 2.3. We assume that the support of a utility rule, ur — 
X U Y U Z, when vertically partitioned in C7, is the average support 
of reconstructing it, avgs(ur, C7), calculated as the product of the 
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minimum possible support by the frequencies of all the other subsets 
of ur in the record chunks: 
P 
avgs(ur, C) = y * | [ry Ri,» 
j=1 
where: 
e fr(y, Ric,) is the frequency of any subset of ur, y € Y, present ina 


record chunk of Ci defined as: 


fiy Ri.) s(y. Ric, ) 
ry, Ric.) = ———— 

E) IRic, | 
e y represents the maximum possible original support of ur in Cj. In 
fact, a utility rule cannot appear more than any of its subsets in the 
record chunks, thus y is the minimum subset of ur in C} : 


7 s(X,C*) if Z = 0 
Y 7| minl7Cl-IZl k - 1) ifz 40 


with 2/7Cl-lZl representing the number of possibilities for recon- 
structing Z in the term chunk TC 
e p is the number of record chunks that contain subsets of ur. 


Definition 2.4. The support of ur in the whole disassociated 
dataset,7 *, is the sum of the average support of ur in every ver- 
tically partitioned cluster: 

s(ur, T^) = >, avgs(ur, CF) 
i=0 

LEMMA. Vur € UR, a k™ — anonymous disassociation ensures 
a — confidence for ur, where: 


1 
<a<l 
ólurl 


PROOF 1. 
k™ — anonymity ensures that any subset in a record chunk is present 
at least k times when its cardinality is less than or equal to m. If its 
cardinality is greater than m, it may be present one time in a record 
chunk and verify the k™ — anonymity constraint. Furthermore, a 
record chunk contains at most 6 records due to the maximum cluster 
size constraint. Then: 


k < s(X, Ric) < ô, if |X| < m 
1 < s(X, Ric) < ô, if |X| >m 


With the same reasoning, we know that the frequency of y in Rc, is 
bounded by: 


~ 


ô 

ea ] LA EE | < 
5 S fa Ric) < 5. iflyl « m 

1 p aS ô. 

5 < fry, ic,) = if |y| > m 
A utility rule can be retrieved in at most |ur| record chunks, when 
every item from ur is present in a distinct record chunk after vertical 
partitioning, then Vy € Y: 

[ur] 


1 
Lur|-i f 
G s | [f Ric,) s1 
Jal 
When Z is reconstructed from the term chunk, TC, it cannot have a 


size greater to k — 1, hence: 


1 € min(27 Cl-IZl k 1) « k-1 
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To generalize, we consider that the least frequent subset of ur is 
present in a record chunk and not in the term chunk with Z = Q, then 
y of definition 2.3 is bounded by: 

1<y<6 


We can deduce from definition 2.3 that: 
1 
Ce < avgs(ur, C7) < ô (1) 


From definition 2.4, we calculate the support of ur in 7 * by adding 
the average support through the clusters that represent ur. In fact, we 
Sun T) clusters. Therefore, we multiply 
the inequality (1) by this minimum number of clusters representing 
ur: 


can reconstruct ur in at least 


VAT < slur, T ^) € slur, T) 
Following definition 2.2, we calculate the a — confidence of ur as: 
. conflur,T*) _ slur, T*) " IT| _ slur, 77) 
conf(ur, T) slur, T) |T| sur, 7) 
therefore: 
: <a<l 
ólurl 


From this analysis, we can see that to ensure k™ — anonymity, the 
confidence of the utility rule can have a very low value. In this case, 
disassociation provides privacy by putting the utility of data analysis 
at risk. 


The m-q relationship: 

It is easy to assume from the above analysis that a — condidence is 
independent from the m control of the k"* — anonymity constraint. 
However, this is not true, due to the direct link between m and verti- 
cal disassociation. In fact, for k"' — anonymity to be achieved; every 
association of up to m items should be present at least k times in 
a record chunk, then if an association cannot be present k times, it 
must be partitioned over multiple record chunks. Considering all 
the m possibilities formed from the items of the utility rule ur that 
will be tested: (2), we can understand how complex it is for a 
utility rule by its own to withstand the k™” — anonymity test and 
be preserved non partitioned in one record chunk. Practically, this 
number of test, (9. is bigger due to other items belonging to the 
record chunk but not to the utility rule. Then, to be more persistent 
to the k" — anonymity, a utility rule should not have a very high 
cardinality, which increases the number of tests to pass with the 
increase of the number of items forming a utility rule. 

We could reflect the m constraint in Proposition-2.4. Eventually 
|ur| € max(|ur], m), then: 


1 
—— —— <a< 
ómax(|ur|,m) ~ aso 
Reaching a more precise lower boundary for a — confidence in the 


above proof, we hold to that result in Proposition-2.4. 


From this perspective of privacy-utility trade-off, we are motivated 
to contribute with a more insightful horizontal partitioning process, 
tolerant to the predefined utility rules. Our aim is to disassociate the 
data, while taking into consideration the utility rules, for future data 
analysis accuracy. As noticed from the example in Fig-2(b), vertical 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 37 


Ant-Driven Clustering for Utility-Aware Disassociation of Set-Valued Datasets 


partitioning is a result of the quality of clustering and is the main 
privacy pillar of the data. In this paper, we apply the same process 
of vertical partitioning proposed in [22] and limit our work on the 
horizontal partitioning. Next section provides a general description 
of the clustering problem and then the role of swarm intelligence 
algorithms for the improvement of data clustering. 


3 DATA CLUSTERING 


Classical clustering: 

Clustering is by definition the task of grouping a set of data with 
similar characteristics together, where data within a cluster are more 
similar to each other than those in the other clusters. There exists 
no unified solution to solve all the clustering problems and it has 
been proven to be NP-hard [23]. We distinguish in the literature 
two different modes for clustering: fuzzy and partitional. In fuzzy 
clustering data items may belong to multiple clusters with a fuzzy 
membership grade like c — means [3]. The previous property does 
not stand for partitional clustering where clusters are totally disjoint, 
as in the widely used K — means algorithm [16]. In this work and in 
accordance with the disassociation principle, we are only interested 
in partitional clustering. 


Swarm Intelligence Clustering: 

Originally, Swarm Intelligence (SI) algorithms were adapted in sto- 
chastic search and optimization problems. They do not focus on the 
strict modeling of the natural processes; but use the best ideas to 
improve the convergence and accuracy of the solutions. Example 
of SI systems are: ant colony system (ACS) [7, 20], particle swarm 
optimization (PSO) [13] and artificial bee colony (ABC) [12]. These 
algorithms draw inspiration from the collective behavior of decen- 
tralized, self-organized natural social animals. Even though particles 
of a swarm may have a very limited individual capabilities, they can 
perform very complex jobs, vital for their survival, when acting as a 
community. Diversified jobs like searching and storing food, clean- 
ing corpse and building nests, are wide examples of the complexity 
of the jobs, performed by the colonies in a perfectionist manner 
without any kind of supervision. Choosing the right algorithm to 
solve a problem relies on the comparability of the given problem's 
background and the swarm's features. 


Ant-based Clustering Algorithm (ACA): 

Ant-like agents have been applied to solve problems in the context 
of objects clustering. The inspired ant clustering algorithm (ACA) is 
modeled after the social behavior of ants sorting larval and cleaning 
corpse. It was first modeled by [5] to solve robotics’ tasks. Two 
major features influence the action of the ants for picking and drop- 
ping items: the similarity and the density of the data within the local 
neighborhood. From this first model, researches introduced many 
variation of the algorithm applicable in wider clustering domain to 
solve different problems [6, 10, 14, 17, 18]. 


In the next section, we define a variant of the ant-based clustering 
algorithm to redefine the horizontal partitioning of the disassociation 
for the predefined utility rules, achieving a higher utility. 
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Table 2: Ant Colony Terminology 


Ant colony Set of utility rules UR. 

Ant Expert agent a; working for the benefit 
of the utility rule ur;. 

Square matrix, A, representing the den- 
sity of each utility rule in each cluster, 
updated through probabilistic picking- 
up and dropping functions. It is the 
sharable memory between the ants. 
Food Data records Tjur from the dataset 7 , 
relative to the utility rules such that: 
Tur - ire 7 |3ur e URandur C r} 
load(aj) contains a data record from 
Tjur that the ant a; is transporting. 
Individual ant's job | Picking-up and Dropping a load. 


Pheromone trail 


Ant's load 


4 UTILITY DRIVEN ANT-BASED 
CLUSTERING (UDAC) 


4.1 Framework of the algorithm 


In this work, our motivation is to show how a clustering optimization 
solution can increase the utility value of the predefined set of utility 
rules in a disassociated dataset. We transform horizontal partitioning 
for the records related to the utility rules into a clustering optimiza- 
tion problem. The proposed algorithm takes advantage from the 
widely explored natural ant behaviors. Table 2 describes the environ- 
ment of our clustering problem in the ant colony system terminology. 
Our problem is challenging due to the fact that: 


e A record might enclose multiple utility rules and since we are 
working in partitional clustering, this record should belong to 
exactly one cluster satisfying one utility rule. 

e The intersection of terms between the records, can affect the 
distance metrics. 

e The maximum cluster size constant, ô, limits the number of 
records allowed in a cluster. 


Let Tjur be the set of records from F concerned with the utility 
rules UR such that: 


VI: -íre7 |Hure URand ur C r} 
Let u be the number of utility rules in question: 
u = |UR| 


Clusters’ initialization: 

In this context, only the set of records related to the utility rules, 
denoted by Tjur, are treated through special clustering. The rest of 
the records from the dataset 7 , which aren't supersets of any utility 
rule, J XV Tjur, are clustered via normal horizontal partitioning. We 
consider that every utility rule shall be represented in its own cluster 
and have one ant as its agent. We initialize our algorithm with u 
clusters and u ants which are the expert agents that will transport the 
loads from and into the clusters. The algorithm starts by sending its 
expert ants in search for records containing their representative utility 
rules from 7|; g; recursively until 7|; g is empty, thus distributing 
the |7|u g| records along the u clusters. Square matrix A = [u][u], 
is defined, representing the pheromone trait which is the collective 
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adaptive memory of the expert ants and is initialized by the value of 
the support of each utility rule ur; in each cluster C}, such that: 
A[i]U] = s(uri, Cj) 
We denote by fur; the ratio of the records representing ur; in cluster 
C i: 
fs AliJ[i] 
UPC sur T) 
Expert Ants" Job: 
During the clustering process, every ant a; works for its utility rule 
ur; to reach a predefined ratio Ppredefined € [0, 1], from the original 
support of the utility rule urj, within cluster C;: 


Bur; 2 Dpredefined 

Originally, an ant working in a clustering problem, as defined by [5], 
picks-up and drops a data object following two probabilistic formu- 
las. The formulas help the ant decide if a data object is dissimilar to 
its neighborhood and if so, to which cluster it must be transported. In 
accordance with the literature, our expert ants move loads between 
the clusters according to the two basic principles, picking-up and 
dropping loads, adapted to our context: 


e For Pick-Up job: the expert ant is responsible of choosing a cluster 
C; and a record r from it, r € Cj, to transport it into another cluster 
Cj representing the utility rule ur;. This pick-up job is controlled by 
the density of ur; in the clusters, defined as follows: 
ALU] 

IC; 
The expert ant chooses the cluster, Cj, with the highest density of 
utility rule urj: 


d(uri, Cj) = 


Cj(urj -  argmax (d(uri,C;)) 
Vjel[1,...,u]- i 
Then, it transports a record r from the Cj to C; such that the record 


r embeds uri, ur; € r. 


load(a, uri) = r such that r € Cj(urj) and ur; C r 


e For Drop job: the expert ant a; is searching for another expert ant, 
aj, that needs the most help to increase its f. We consider that ant 
aj is in need for the most help when reaching fur; for its utility rule 
ur; demands the highest number of iterations: 
urj =  argmax (s(urj, T)* (Dpredefined = Bur;)) 
Vje[1,...,u]-i 

Then, expert ant a; picks-up a record r for ur; to add it to its rep- 
resentative cluster C; following the Pick-Up job scenario described 
above, helping the ant aj. 


Individual and cooperative work of expert ants: 

One of the most remarkable trait of any swarm intelligent system, is 
that the cooperation between the particles of the system leads to the 
optimized solution despite the limited capabilities of each particle 
by its own. As discussed before, we recognize this trait in the ant 
colony clearly when searching for the shortest path leading to food, 
collecting and grouping corpses. 


e If Pur; 2 Boredefined: 
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To speed up the convergence of the solution and prevent the ants 
from moving aimlessly, expert ant a; is now free to help another ant. 
In this case and in the current iteration, ant aj stops searching in the 
space for a record satisfying its utility rule, ur;. Ant a; searches for 
the most vulnerable utility rule according to the Drop job scenario, 
and picks a corresponding record to transport it to the vulnerable 
cluster, following the Pick-Up job scenario. 


o If Bur; < Boredefined: 

In this case ant a; is responsible and fully dedicated to its specific 
utility rule, ur;. Ant aj transports a relative record to Cj, according 
to the Pick-Up job scenario, to increase $,,;. At each iteration By, 
is rechecked, for every utility rule ur. In fact, even if at a certain 
iteration Bur = Boredefined this inequality might not stand at the next 
iteration, due to the exchange of loads that happened during the last 
iteration. 


This cooperation between ants is possible thanks to the collective 
memory stored in the pheromone trail matrix A. The clustering pro- 
cess is iterative until attaining a predefined number of iterations or a 
level of stability. In what follows we present the algorithm behind 
the ant based clustering methodology and explain it thoroughly. 


4.2 Our algorithm 


In this section, we present our algorithm the Utility-Driven Ant- 
based Clustering (UDAC), that applies the process described above. 
The algorithm starts by defining the set of records 7|y g, from the 
original dataset 7 , representing the utility rules (line 1). It continues 
by creating u clusters, u expert ants and a square matrix A = [u][u] 
(line 3), all these will be dedicated to represent the u utility rules 
through the clustering process. Every cluster C; will be the official 
nest of exactly one utility rule, uri, and expert ant a; will be working 
for its benefit while necessary. 

At the clusters’ initialization phase, successively every expert ant 
representative of a specific utility rule picks a record from Tyg that 
embeds its corresponding utility rule; this process is recursive until 
emptying 7|ug (lines 5-13). This doesn’t mean that the distribution 
of the records has been fair for the utility rules. In fact, a record may 
contain multiple utility rules and having one ant transporting it to 
its own cluster, means that other ants lost it. This calls for a refining 
process in the next steps. 

While transporting the records, the pheromone matrix A, is updated 
with the support of the utility rules in the clusters (lines 9-11). 

For a predefined number of iterations, or until every ant becomes 
jobless (line 14), each ant proceeds according to the following logic: 
First it calculates the f, of its utility rule ur, (line 17) and checks 
if Bur < Boredefinea Or if there is less than k records embedding ur 
in its corresponding cluster, while there is still available respective 
records outside the cluster; in this case, function PICKUP is called 
(lines 18-19). 

At this point, the expert ant is responsible of bringing a record con- 
taining the utility rule in question, ur, to its representative cluster. 
Actually, the PICKUP function chooses the cluster that has the high- 
est density of ur (line 2) and picks a corresponding record r (line 7), 
withdraw it from its cluster (line 8) and drop it in the cluster repre- 
sentative of ur (line 9). Then, the pheromone matrix, A, is updated 
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for both, the source and the destination clusters, based on what util- 
ity rules are found within the transported record r (lines 10-15). 
However, when the record r is picked from a cluster that has a size 
less then k; the expert ant finds a record from another cluster that 
can fill the gap of transporting r, by calling recursively the PICKUP 
function; ensuring that the cluster preserves its size. 

However, if Bur = Bpredefined, the expert ant is free to work for the 
benefit of another ant during the current iteration. In this case, the 
number of jobless ants increases (line 21) and function DROPLOAD 
is called (line 22) to find another ant to help. Function DROPLOAD 
finds the utility rule that still needs the most iterations to achieve 
the Bpredefined (line 2). Then, it calls the PICKUP function to find a 
record that can be transported to the corresponding cluster. 

At the end of the iterations, there exist u clusters, each represent- 
ing mainly one utility rule. The resulting clusters may have sizes 
greater than the maximum cluster size ô allowed. To abide to the 6 
constraint of the disassociation technique, every cluster is passed to 
the SPLIT function (line 28) and is split into smaller clusters having 
respectively a size less or equal to 6, when necessary. 

The algorithm ends by vertically partitioning the resulting clusters 
from (UDAC) (line 30) and treats all the other records via the normal 
processes of the disassociation technique (line 31). 


Algorithm 1 UDAC 


Require: 7, UR, k, 5, predefined, it 
Ensure: 7* 
1: Jur = {r | r € T and 3ur € UR and ur C r} 
2: u = |UR| 
: create u clusters, u ants and square matrix A[u][u] 
: idt count = 0 


3 
4 
5: while (Typ # 0) do 
6 
7 
8 


for each expert ant a; do 
Ci-cG;ur| r € Tjur and ur; cr 


Tur = lun Vr 


9: for (j = 0; j < u; j + +) do 
10: A[i][j] = s(uri, Cj) 
11: end for 

12: end for 


13: end while 
14: while (it count « it or jobless ants « u) do 
15: jobless ants = 0 


16: for each expert ant a; do 
A[i][i 

IT Bury = sug; 

18: if (Bur; < Ppredefinea 0r Ali]li] < k < suri, 7)) then 

19: PICKUP(aj;, uri, 0) 

20: else 

21: jobless ants + + 

22: DRoPLoAD(a;) 

23: end if 

24: end for 

25: it count + + 


26: end while 

27: for each cluster C; do 

28: ClustersSet = ClustersSet W SPLIT(C;) 
29: end for 

30: VERTICALPARTITIONING(ClustersSet) 
31: DisassociarioN(7 V Tiu g) 


5 EXPERIMENTS 
5.1 Experimental settings 


In this section we present a set of experiments, evaluating our utility 
driven ant-based clustering technique, UDAC, in terms of privacy 
and utility. We choose for the experiments the BMS1 dataset, which 
contains click-stream E-commerce data, and 6 sets of utility rules 
extracted from the dataset with different characteristics shown in 
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1: procedure PICKUP(a;, urj, Recurcount ) 
2: Cpu = arg max( SUD), Vn!-2j 
3: if |Cpu| < k and Recurcount < u then 
4: Recurcount + + 

5: PICKUP(a;, urj, Recurcount) 

6: end if 

T: load(ai) =r |r € Cpu andurj Cr 
8: Cpu = Cpu Vr 

9: Cj = Cj U load(a;) 

10: for (l = 0;1 < u; l + +) do 

11: if (ur; € load(aj)) then 

12: A[l][pu] - — 

13: A[U] + + 

14: end if 

15: end for 


16: end procedure 


: procedure DROPLOAD(a;) 

urq = argmaxyji-i(s(urj, T) * (Ppredefined - Pur;)) 
PickUP(ai, urq, 0) 

: end procedure 


Fe oS 


1: procedure SPLIT(C;) 

2: if |Cj | > ô then 

3: create new cluster Cne w 

4: for (int 1 = 0;1 < ô; l + +) do 
5: load(aj) 2 r | r € Ci 

6: Qi-CPRMÉ 

n Cnew = Cnew U load(aj) 
8: end for 

9: AntClusters = AntClusters U Cnew 
10: SPLIT(C;) 

11: else 

12: AntClusters = AntClusters W Cj 
13: end if 


14: Return AntClusters 
15: end procedure 


Table 3: Utility rules’ characteristics 


Utility Rules Set setl set2 set3 set4 set5 set6 
Set Cardinality 5 10 15 20 25 30 
Records Interdependency 1.35 1.39 1.38 1.39 1.65 1.82 
Lowest Frequency 359 8 8 4 8 2 
Highest Frequency 1204 85 1204 460 359 225 
Records Count |7|u gl 2379 | 231 2235 1173 | 898 823 


Table 3. Records Interdependency represents the average number 
of utility rules in the records of 7j; g. We compare UDAC (with 
a = 0.8) to the widely used unsupervised clustering technique, 
k — means, and horizontal partitioning (HP) of disassociation. 


5.2 Utility privacy trade-off evaluation 


In the disassociation technique, privacy of the data is preserved 
based on the k” — anonymity model through the process of vertical 
partitioning. Our work isn't challenged by redefining the vertical 
partitioning that is responsible of the privacy preservation. How- 
ever, for any data analysis upon the anonymized data we should 
study what is the effect of the disassociation on the final output in 
terms of data accuracy. In the following experiments we highlight 
the effect of different clustering techniques on the final result of 
vertical partitioning, for the utility rules. On the same set of records 
representing the utility rules, we run respectively k — means with 
the cosine distance and our UDAC algorithm, then we apply vertical 
partitioning from the disassociation technique for each cluster. 


5.2.1 Relative loss. The first metric we use to evaluate the loss in 
association with regard to the predefined utility rules, is the relative 
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loss metric RLM, defined as follows: 


ier (s(uri, T) ~ s(uri, T*)) 
RLM = IURI 
ia SUr T) 

where, s(urj, T) and s(urj, 7 ^) represent the support of the utility 
rule ur; in the original dataset, 7 , and in the anonymized dataset, 
7 * respectively. 
Figure 3 shows a huge effect of the different clustering techniques 
on the loss of utility rules, with UDAC preserving the lowest RLM 
for all the sets compared to k — means and HP. However, it is to 
be noticed that for the highly frequent set of utility rules, set1, the 
clustering techniques reveal near values for RLM. Since HP already 
works under the concept of items' frequency, it can preserve better 
utility for frequent itemsets, keeping it more challenging for the 
sets which contain infrequent utility rules like set2, which reveals 
this weakness of HP. These results manifest the crucial need for a 
special clustering technique for utility preservation, conform to the 
discussion in section 2. 


80 7 

= —@— UDAC 
IS 60 - -| | —H— Kmeans 
z o 40 {je 7 
z 20 - Eod su | 

Í | | 

10 20 30 

SetCard 


Figure 3: Relative Loss Metric 


5.2.2 Average record partitioning. ARP is used to represent the 
average number of split-ups a record had to endure after vertical 
partitioning. We define ARP as: 
_ LieillCil * count (Ric)) 

I7iunl 
where count(Ri,.) is the number of record chunks in cluster C; and 
IT UR| is the number of records treated through special clustering. 
Low ARP means that less noise is added from vertical partitioning, 
being able to save more terms associated in the record chunks. Figure 
4 shows an out-performance of UDAC over k — means clustering. 
What is surprising is that although k — means uses distance metrics 
to evaluate the similarity between the records over the whole set 
of items, while UDAC groups records based on the presence of 
utility rules in the records; the final vertical partitioning is better for 
UDAC than k — meanss clustering. This means that beyond the items 
belonging to a utility rule, the other items present in |7|y g| weren't 
vertically disassociated in an abusive way, reflecting efficiency of 
our solution in comparison to the classical clustering technique, 
k — means. 


ARP 


5.2.8 UDAC under the utility-privacy perspective. This experi- 
ment communicate the whole essence of the trade-off between utility 
and privacy in disassociation. We set Bpredefined = 0.8, the maximum 
number of iterations = 3000 and run UDAC algorithm to see the 
average fur of the sets. To avoid the cluttering of the graphs, we only 
show, in Figure 5 the result for two sets: set1 in red and set6 in blue. 
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Figure 4: Average Record Partitioning 


An average Pur greater than predefined i$ possible when the records’ 
interdependency is low as in set1, this means that the representation 
of the utility rules in the clusters is more accurate. When, this in- 
terdependency between the records is higher, as in set6, it becomes 
harder to achieve the P redefined: On another side, the graph shows 
a lower Pur with the increase of the values of k and m, the higher 
the privacy is requested, the lower the utility can be achieved by 
safeguarding the utility rules without vertical partitioning. 


Figure 5: The k-m-f relationship 


6 CONCLUSION 


Choosing the right anonymization technique, depends mainly on the 
type of data in question and the desired result after anonymization. 
In fact, many techniques are proposed in the literature either for 
query answering or for publishing the data, while preserving the 
privacy of the users. Anonymization becomes harder when data must 
be analyzed after publishing it, and the challenge is to find a good 
trade-off between privacy and utility. Disassociation is a technique 
that provides a form of anonymization without altering the value 
of the data items. It guarantees k"' — anonymity for associations 
within a cluster by vertically partitioning it into record chunks. In 
this paper, we analyze the loss of associations for aggregate analy- 
sis, showing its direct link to the clustering process of the records 
before vertical partitioning. Driven by this problem we propose the 
utility guided ant-based clustering algorithm, (UGAC), to drive the 
process of clustering for a set of records representing the predefined 
utility rules, to increase their utility by pushing their preservation 
non-partitioned in an indirect way. Finally, to test our algorithm, 
we compare our algorithm, for various properties of utility rules, 
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with the classical clustering technique, k — means, and the normal 
horizontal disassociation. The result shows that the information loss, 
for the utility rules clustered via our UGAC algorithm, decreases 
compared to the other two solutions. 

From the theoretical analysis in section-2, a data analyst have to 
first calculate the average support of associations, to be able to ex- 
ecute analysis over a disassociated dataset. If the associations in 
questions aren't predefined and treated through UGAC, the accuracy 
of any analysis is under risk. In future works, we aim at publishing 
anonymized set-valued data, fully ready for analysis. 
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ABSTRACT 


Recent advances in sensor technology and information processing 
have allowed connected environments to impact various appli- 
cation domains. In order to detect events in these environments, 
existing works rely on the sensed data. However, these works are 
not re-usable since they statically define the targeted events (i.e., 
the definitions are hard to modify when needed). Here, we present 
a generic framework for event detection composed of (i) a repre- 
sentation of the environment; (ii) an event detection mechanism; 
and (iii) an Event Query Language (EQL) for user/framework in- 
teraction. This paper focuses on detailing the EQL which allows 
the definition of the data model components, handles instances of 
each component, protects the security/privacy of data/users, and 
defines/detects events. We also propose a query optimizer in order 
to handle the dynamicity of the environment and spatial/temporal 
constraints. We finally illustrate the EQL and conclude the paper 
with some future works. 


CCS CONCEPTS 


* General and reference — General conference proceedings; 
* Information systems — Query representation; - Theory 
of computation — Grammars and context-free languages; 
* Computer systems organization — Sensor networks. 


KEYWORDS 
Event Query Language, Internet of Things, Sensor Networks 


ACM Reference Format: 

Elio Mansour, Richard Chbeir, and Philippe Arnould. 2019. EQL-CE: An 
Event Query Language for Connected Environments. In 23rd International 
Database Engineering & Applications Symposium (IDEAS’19), June 10-12, 
2019, Athens, Greece. ACM, New York, NY, USA, 10 pages. https://doi.org/ 
10.1145/3331076.3331103 


1 INTRODUCTION 


Recent advances in the fields of Information & Communication 
Technologies (ICT), Big Data, Sensing Technologies, and the In- 
ternet of Things (IoT) have paved the way for the rise of smart 
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connected environments. These environments are defined as in- 
frastructures that host a network of sensors capable of providing 
data that can be later mined and processed using advanced tech- 
niques, for high level applications. Hence, Sensor Networks (SN) 
are currently impacting numerous domains (e.g., medical, environ- 
mental, cities, buildings). This allowed a plethora of sensor-based 
applications such as monitoring a patient's health [20], detecting 
fires in the wilderness [24], monitoring pollution levels or traffic 
congestion in a city[15], and optimizing energy consumption/occu- 
pants' comfort in buildings[1, 8, 14, 19, 23, 25]. Even though these 
applications have different objectives, they all rely on sensed data 
from the environment in order to detect specific events (e.g., a 
stroke for a patient, a volcanic eruption, a storm, polluted air in a 
city, temperature rising in an office). Therefore, these applications 
share the following needs: (i) representing the infrastructure and 
the sensor network of the connected environment; (ii) defining and 
detecting the targeted events; and (iii) protecting the security of the 
sensed data and the privacy of the users in the environment (e.g., 
protecting patients' medical records). In the aforementioned works, 
the authors do not emphasize on the environment's representation 
and define the events statically. They also proposed event detection 
mechanisms that perfectly fit the description of the targeted events. 
This is constraining since these works are not re-usable in different 
contexts. Event Query Languages (EQL) have been proposed to 
overcome this issue. Users express their needs through EQLs by 
defining the structure of the targeted events. However, existing lan- 
guages [2, 3, 6, 7, 9-11] focus mainly on the event descriptions and 
do not consider other environment components (e.g., infrastruc- 
ture, sensor network, application domain). They share the following 
limitations: 


(1) Lack of considered components. It is important that the EQL 
allows the definition of the entire connected environment. 
This includes components related to the environment itself, 
its sensor network, the targeted events, and the application 
domain. 

(2) Lack of considered functionality. It is important that the EQL 
(i) allows the definition of components (e.g., buildings, sen- 
sors, data, events); (ii) allows the manipulation of component 
instances (e.g., inserting new instances, updating, deleting, 
selecting them); and (iii) protects the security/privacy of 
data/users. 

(3) Lack of re-usability. It is important that the EQL remains 
generic and independent from any technological constraints 
or underlying infrastructure. Some languages heavily rely on 
a specific syntax or data model (e.g., SOL-based, or SPARQL- 
based) and this limits their re-usability in different contexts. 
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In order to consider the dynamicity of connected environments and 
spatial/temporal distributions, we consider the following limita- 
tions as well. First, the difficulty in handling dynamic environments. 
Since sensors might breakdown, or mobile sensors could change 
locations or even enter/exit the network, sensors/observations that 
are needed for a previously defined event might become unavailable. 
Therefore, it is important that the EQL allows query re-writing in 
order to update obsolete event definitions. This entails replacing 
missing sensors by others capable of providing the required data 
or replacing missing observations with others that fit the event 
definition. Second, the lack of spatial distribution of sensors. Since 
the sensors’ locations impact event detection, the EQL should allow 
users to define spatial distributions of the sensors over the infras- 
tructure in order to better detect the targeted events. This entails 
specifying where each sensor should be located or how they should 
be distributed over the space (e.g., nearest sensors to a point of in- 
terest, sensors within a range of a point of interest, sensors that fit 
a mathematical distribution around a point of interest). Finally, the 
lack of temporal distribution of sensor observations. Since sensors 
provide observations at specific rates, one could end up with either: 
(i) big volumes of unnecessary data (if the rate is too quick); or (ii) 
undetected events (if the rate is too slow). Therefore, it is important 
to have an EQL that allows the adjustment of the temporal distri- 
bution of sensor observations based on events' needs/requirements. 
This entails specifying which sensor observations/sensing rates are 
considered for a specific event, or selecting a temporal distribution 
of these observations (e.g., the closest observations to a certain 
point in time, all observations within a temporal range, distributed 
sensing rates). 

Many other challenges emerge when considering an EQL for con- 
nected environments (e.g., handling big volumes of data, continu- 
ous heterogeneous data streams). However, in this paper, we focus 
mainly on the aforementioned limitations. Hence, we propose here 
an EQL specifically designed for connected environments and parti- 
tioned into three layers: conceptual, logical, and physical. It allows 
(i) the composition of high level generic queries that can be parsed 
into various data model specific languages (re-usability); and (ii) full 
coverage of components and functionality (we will detail security 
related tasks in a dedicated future work). We also propose a query 
optimizer module that will handle spatial/temporal distributions 
and query re-writing in order to redefine components that need to 
evolve when handling the environment's dynamicity (the optimizer 
will be fully detailed in a separate work). Our proposal, denoted 
EQL-CE, is part of a global framework for event detection in con- 
nected environments which we will also present in this work. 
The remainder of the paper is organized as follows. Section 2 
presents a scenario that motivates our proposal. Section 3 evaluates 
existing approaches. Section 4 presents our event detection frame- 
work and details EQL-CE. An illustration example is presented 
in Section 5. Finally, Section 6 concludes the paper and discusses 
future research directions. 


2 MOTIVATING SCENARIO 


In order to motivate our proposal, consider the following scenario 
that illustrates a smart mall. This is a simplified example that illus- 
trates the setup, the needs, and motivations behind our proposal. 
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Figure 1: The Smart Mall 


Of course, it does not summarize all needs found in a connected en- 
vironment/event detection application scenario. Figure 1 details the 
infrastructure's location map, and individual locations (i.e., shops 
and open areas). The mall is equipped with a hybrid sensor network 
having static/mobile sensors, single sensor nodes/multi-sensor de- 
vices capable of monitoring the environment and producing scalar/- 
multimedia observations (e.g., temperature, video). A manager uses 
an Event Query Language (EQL) in order to define/detect inter- 
esting events that occur within the mall's premises. Although this 
seems enough to manage the smart mall, many improvements can 
still be integrated: 


e Need 1: Modeling the environment and its sensor network. 
Before defining and detecting events, a mall manager needs 
to represent the smart mall using the EQL. This includes 
defining the infrastructure (i.e., the mall), the locations (e.g., 
shops), and their spatial relations. Then, the manager needs 
to define the sensor network that is hosted in the mall. This 
entails modeling the available sensors (e.g., temperature, 
humidity), their deployment locations, the data they sense 
and so on. Once all component structures are defined, the 
mall manager needs to use the EQL to create instances of 
each component (e.g., temperature sensor in food court). This 
is currently not possible since the EQL used in the example 
only defines events. 

e Need 2: Measuring the average temperature in the grocery 
store (for food storage concerns). The mall manager uses 
the existing EQL to define the targeted event (i.e., the aver- 
age temperature in the grocery store). The EQL allows the 
manager to consider all sensors within the area of interest. 
However, Figure 2.a shows that the sensors are not evenly 
distributed in the store (most are located in the upper left 
corner). Hence, considering all sensors and calculating the 
average will produce a biased temperature measure that does 
not reflect the reality of the situation. This can be solved 
by allowing the manager to define a specific distribution of 
sensors over the space (e.g., even distribution, only consider- 
ing sensors within a range of the center of the store). The 
current setup is limited since it does not allow the definition 
of spatial distributions of sensors. 

e Need 3: Minimize data overload/missed events. Currently, 
the manager can use the EQL to define one sensing rate for 
all sensors or sensor types (e.g., temperature, humidity). This 
is constraining since (i) a quick sensing rate overloads the 
system with big volumes of unwanted/unnecessary data; 
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and (ii) a slow sensing rate could lead to missing events 
that began and ended in a short time lapse. Therefore, the 
temporal distribution of sensor observations (i.e., a start 
time, a specific rate, a stop time) should be based on the 
event definition and therefore considered/handled in the 
event queries (e.g., selecting the closest observations to a 
time of interest, considering different sensing rates from 
various sensors at once). The EQL used by the mall manager 
does not allow such customization of temporal constraints 
(cf. Figure 2.b). 

e Need 4: Detecting a fire event in Shop 1. The mall manager 
defines this fire event using the EQL. His/Her definition 
relies on the smoke, humidity, and CO? sensors located in 
Shop 1. However, what if the smoke sensor broke down ? 
Or what if the mobile device that he/she was depending on 
left the shop ? Then, the previously defined event query will 
become obsolete since there are no more smoke observations 
coming from shop 1, and there is no way of changing the 
event definition. Hence, query re-writing is necessary in 
order to update the definition: (i) by replacing the smoke 
sensor by another capable of providing the same data (e.g., 
mobile device 1 - cf. Figure 2.c left); or (ii) by replacing the 
event describing feature smoke by another (e.g., temperature 
from mobile device 1 if no other sensors can provide smoke 
observations - cf. Figure 2.c.right). The current EQL is limited 
since it does not allow such re-writing. 


In order to address the aforementioned needs, the EQL should 
provide a means for defining the structure of various components 
related to the environment, sensor network, targeted events, and 
application domain. Moreover, the EOL should not be limited to 
defining components. Its functionality should extend to managing 
instances of these components, and protecting the security/privacy 
of the data/users (cf. Need 1). In addition, customizing the sensors' 
spatial distribution over the infrastructure/environment based on 
event requirements is required (cf. Need 2). This benefits the event 
detection since it provides the user with the ability to customize 
the setup in the way that he/she believes is optimal. The same 
is also applied for temporal distribution of sensor observations. 
The EQL should allow the user to select specific observations, or a 
set of distributed observations in time (e.g., considering different 
sensing rates, temporal distance to a point in time) when defining 
the event (cf. Need 3). Finally, the EQL should allow re-writing 
queries (cf. Need 4) to handle the dynamicity of the connected 
environment. This is especially beneficial when faults or sensor 
breakdowns/mobility can render some event definitions obsolete. 
However, when considering various components, functionality, 
data distribution (e.g., spatial, temporal), and query re-writing the 
following challenges emerge: 


* Challenge 1: How to model components and inter-component 
relations? How to establish ties between the different con- 
nected environment elements (i.e., environment, sensor net- 
Work, events, and application domain)? 

* Challenge 2: How to define different query types to cover all 
the required functionality? 


Average 
g Temperature 


(c) Need 4 


Figure 2: Spatial Distribution (a) | Temporal Distribution (b) 
| Query Re-writing (c) 


e Challenge 3: How to establish a generic query syntax that 
can be re-used regardless of the underlying infrastructure 
(e.g., in a traditional database or in an ontology data model)? 

e Challenge 4: How to integrate variables that specify spa- 
tial/temporal distributions in the query syntax ? How to 
propose different types of distribution queries ? 

e Challenge 5: How to enable query re-writing upon user re- 
quest ? How to replace missing sensors/event describing 
features when re-writing a query ? 


Therefore, we propose here a high-level generic event query lan- 
guage, denoted EQL-CE, capable of covering all components. Our 
covered functionality are partitioned into three main categories for 
component definition, manipulation of component instances, and 
data protection (to be discussed in a future work). We also propose a 
query optimizer that handles query re-writing and spatial/temporal 
distribution functions. In this paper, we present the optimizer but 
leave the details of the query re-writing and distribution functions 
to a separate dedicated work. 


3 RELATED WORK 


In this section, we review existing works on Event Query Languages 
(EQL). We propose the following criteria based on the challenges 
and limitations discussed in Section 2: 
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e Criterion 1. Component/Functionality Coverage: Denoting 
if the EQL covers (i) the entire components that constitute a 
connected environment (i.e., environment, sensor network, 
application domain, and event related components); and 
(ii) the entire set of functionality needed for the definition 
of components, the manipulation of their instances, and 
protection of the data/user security and privacy (cf. Need 1). 

e Criterion 2. Re-usability: Indicating if the EQL is generic 
and technology independent in order to re-use it in various 
setups with different underlying infrastructures (e.g., tra- 
ditional database, ontology). It is beneficial to have a high 
level, generic, and declarative EQL that can be parsed into 
data-model specific languages (instances). This facilitates its 
integration in various contexts. 

e Criterion 3. Spatial/Temporal Distributions: Specifying if the 

EQL allows (i) spatial distribution queries (e.g., selecting sen- 

sors that are distributed based on a mathematical law, within 

a specific range, or near a point of interest); and (ii) temporal 

distribution queries (e.g., selecting sensor observations that 

are closest to a point in time that have various sensing rates). 

This is important for the definition of specific events where 

such level of detail is required (cf. Needs 2 and 3). 

Criterion 4. Handing Environment Dynamicity: Stating if the 

EQL provides the means to modify the structure of previously 


defined components (e.g., events) in order to keep up with 
environment changes. This is useful in a dynamic setup, 
where sensor mobility causes gain/loss of data in certain 
areas of the environment (cf. Need 4). 


We group the existing works into three main categories: (i) con- 
ceptual languages (e.g., Event-Condition-Action languages) ; (ii) 
logical languages; and physical languages (e.g., SQL/SPARQL-based 
languages). We compare in the following some works from each 
category (we do not detail here every existing event query language 
for the sake of brevity). 


3.1 Conceptual Languages 


This category of languages includes Event Condition Action (ECA) 
languages that allow the declaration of three event attributes: (i) 
an event name or label; (ii) a set of conditions (the pattern) that 
best define the event; and (iii) the set of actions that should be 
triggered once the event is detected. In [9], the authors propose an 
intuitive event query language denoted SNOOP. They follow the 
ECA model when defining event structures. They integrate oper- 
ators for inter-condition relations (e.g., conjunction, dis-junction, 
and sequence) and represent repetitive events through the usage 
of the periodic/aperiodic operators. In [6], the authors propose a 
language denoted CeDR. In comparison with SNOOP, CeDR adds 
a WHERE clause for filtering statements and has a wider range 
of operators. Therefore, CeDR is considered more expressive in 
terms of event pattern description. CeDR also includes an event 
lifetime operator and a detection window operator. The authors 
in [11] propose an event query language for data streams called 
SaSE. They include the WITHIN and RETURN statements to respec- 
tively declare sliding time windows and the required output. SaSE 
also allows event pattern operators (similar to CeDR) in a WHERE 
clause. 


Discussion: The aforementioned works are intuitive, practical, 
and allow various composition operators for event definition. Their 
syntax is also independent from specific data models (e.g., SQL or 
SPARQL). However, they all suffer from the same limitations. None 
of them covers the environment or sensor network definition in 
their queries (cf. Criterion 1 - Component Coverage). They mainly 
focus on the definition and retrieval of events while neglecting other 
tasks such as updating definitions or inserting data (cf. Criterion 1 - 
Functionality Coverage). They also do not consider spatial/temporal 
distributions (cf. Criterion 3). 


3.2 Logical Languages 


This category includes works that define events in logic style for- 
mulas. To give a few examples, consider ETALIS[3]. This EQL de- 
scribes events as rules. The authors propose a set of temporal rela- 
tions and composition operators to define the event patterns. The 
syntax of the rules is independent of any underlying data model. 
XChangeEQ[7] is another logical language. The authors allow the 
following features in its queries: (i) data-related operations such as 
variable bindings and conditions containing arithmetic operations; 
(ii) event composition operators such as conjunction, dis-junction, 
and order; (iii) temporal and causal relations between events in the 
queries; and (iv) event accumulation, for instance aggregating data 
from previous events to discover new ones. 


Discussion: The aforementioned languages are re-usable in dif- 
ferent contexts since their syntax, a logical rule-based notation, 
is independent of specific data models (cf. Criterion 2). They also 
cover the majority of temporal and composition operators. How- 
ever, they do not cover spatial/temporal distributions (cf. Criterion 
3). These languages have not fully detailed query re-writing (cf. 
Criterion 4), and they mainly focus on the events. They cannot be 
used to define and manage the environment and sensor network 
components (cf. Criterion 1). 


3.3 Physical Languages 


This category includes data model specific languages. We detail 
here languages that were specifically designed for either relational 
database or linked data management systems. Therefore, the follow- 
ing EQLs are either inspired from or directly extend SOL/SPARQL. 
ESPER[10] is an implementation for event detection in database sys- 
tems. The authors proposed an SOL-like syntax for event processing. 
Therefore, known operators such as CREATE, SELECT, INSERT, UP- 
DATE, and DELETE are available for event definition and detection. 
ESPER also includes temporal operators and a specific statement 
for event definition (i.e., the pattern). In addition to the aforemen- 
tioned advantages, this language has a fast learning curve since it 
is highly similar to traditional SOL. COL[4] is another language 
that can be used for event definition/retrieval. COL extends SOL 
by emphasizing on continuous data streams/queries. The authors 
add temporal operators, sliding windows, and window parameters 
to better handle continuous data. Many languages extend SPAROL 
for linked data management systems. For instance, C-SPAROL[5] 
extends SPARQL to consider stream data in the queries. To do so, 
the authors integrate sliding time windows. SPARQL-ST[17] ex- 
tends SPAROL by adding operators for spatial/temporal queries. 
This covers the definition and manipulation of spatial shapes and 
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temporal entities. Finally, EP-SPARQL[2] integrates event process- 
ing operators (e.g., sequence) into the SPARQL syntax. This work 
allows the definition of simple and complex event patterns in a 
linked data management system. 


Discussion: The aforementioned works are all user friendly since 
they extend known languages. They cover definition and manipu- 
lation queries for various components or entities (cf. Criterion 1). 
They also provide a basis for spatial/temporal operators and query 
re-writing. However, distribution queries are not considered (cf. 
Criterion 3) and their high reliance on a specific data model syntax 
(SQL or SPARQL) limits their re-usability in different systems (cf. 
Criterion 2). For instance, EP-SPARQL cannot be used in a relational 
database infrastructure. 


To conclude this section, none of the mentioned works fully consid- 
ers our entire list of criteria. Therefore, we propose in the following 
section the Event Query Language for connected environments 
(EQL-CE). Our proposal has three layers (conceptual, logical, and 
physical). It ensures re-usability, handles dynamic environments, 
fully covers the components/required functionality, and integrates 
spatial/temporal distribution variables in its queries. 


4 EQL-CE: AN EQL FOR CONNECTED 
ENVIRONMENTS 


In order to highlight the usage of EQL-CE, we present here an 
overview of our framework for event detection in connected en- 
vironments. This framework includes the following modules: (i) a 
data model representing the connected environment; (ii) an event 
Virtual Machine (eVM) for event detection; and (iii) an event query 
language for user/framework interaction. We start by briefly de- 
scribing these modules. Then, we detail our proposed event query 
language for connected environments, denoted EQL-CE. 


4.1 Event Detection Framework 
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Figure 3: Global Framework Overview 


Figure 3 illustrates our event detection framework. It contains 
three main modules: 
e An event query language for connected environments (EQL- 
CE) and its query optimizer. 
e A data model for connected environment representation. 


e An event Virtual Machine for event detection (eVM). 


4.1.1 Event Query Language. Users interrogate the system using 
the event query language. It is pivotal since it affects both the data 
model and the event Virtual Machine (event detector) modules. EQL- 
CE offers queries that can be used to define the structures of the data 
model components (i.e., entities that represent the connected envi- 
ronment). In addition, the language allows users to import external 
data models in the framework. Once the data model is defined, it 
is saved in the data storage. EQL-CE also manages instances of 
the previously defined components. It supports operations such as 
inserting new instances or even modifying, deleting, and retrieving 
existing ones. Also, the security and privacy of data/users can also 
be provided by EQL-CE via specific queries. From an event detec- 
tion standpoint, users can trigger the event Virtual Machine via 
the query language in order to detect specific events. Finally, the 
query optimizer allows re-writing queries when needed, and can 
integrate spatial/temporal distribution functions in the queries (cf. 
Criterion 3 and 4). Both EQL-CE and its query optimizer will be 
further detailed in the following subsection. 


4.1.2 Data Model. The data model of the connected environment 
gathers components that describe the environment itself, the sensor 
network, the events, and the application domain. When considering 
the environment, one might represent physical, real world, infras- 
tructures such as buildings or offices and all their characteristics. 
This includes spatial descriptors (e.g., location maps, zones, indi- 
vidual locations, spatial relations), and specific entities that can 
be found in the environment (e.g., machines, equipment, devices). 
When considering the sensor network, one might represent sensors, 
observable properties, scalar/multimedia data, and so on. The tar- 
geted events should also be defined and described in the model. This 
includes event features, types, and patterns. Finally, the application 
domain is also a part of the model since it affects both the events 
and the environment. For instance medical events (e.g., high heart 
rate) differ from environmental events (e.g., temperature overheat 
in a room). Similarly, the equipment and entities found in a mall 
are different from the ones found in a hospital. 


4.1.3. Event Detector. We proposed the event Virtual Machine (eVM) 
in a previous work [16]. It is an event detector that needs an event 
definition and a set of data objects (e.g., sensor observations) in or- 
der to detect targeted events. eVM is re-usable in different contexts, 
extensible, accepts various datatypes, easy to integrate, and requires 
low human intervention. The event detection process starts by re- 
trieving the targeted event definition form the storage unit. The 
event definition is analyzed first in order to check its describing 
features. For instance a fire event is described by the following 
features: time, location, temperature, smoke, and CO». Then, the 
pre-processor retrieves data objects (e.g., sensor observations) hav- 
ing attributes related to these features (e.g., smoke, temperature, 
CO» observations). Once this is done, we use Formal Concept Anal- 
ysis (FCA), a conceptual clustering technique, in order to construct 
a graph from the selected data objects/attributes. Finally, we detect 
the targeted events by examining the graph nodes and selecting 
the ones that are compatible with the event definition. Also, eVM 
is pluggable in the framework and can be replaced by any other 
event detector that requires data and an event definition in order 
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to detect events. We do not detail the event detection mechanism 
in this paper since the aim is to focus on the event query language 
for connected environments. 


4.2 The EQL-CE Proposal 


We structure our proposal into three layers: (i) the conceptual layer 
provides an overview of the connected environment's components 
and their relations in the form of a graph; (ii) the logical layer al- 
lows the construction of generic queries written in EBNF (Extended 
Backus Norm Form) syntax; and (iii) the physical layer parses 
the EBNF queries into a data model-specific language (e.g., SQL, 
SPARQL) and executes the parsed queries. A simplified overview 
of EQL-CE is presented in Figure 4. In the following we detail each 
layer separately. 


Query Composition 


Conceptual Layer 


E Parsi E i 
Logical Layer Query Parsing & Execution 


Physical Layer 


Figure 4: EQL-CE Overview 


4.2.1 Conceptual Layer. Here, we detail the top layer of EQL- 
CE. The aim is to provide a clear and easy to exploit conceptual 
view of the connected environment. Therefore, we use a graph to 
represent the various elements (i.e., components and properties). 
The latter are split into the following categories (cf. Figure 5): 
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Figure 5: EQL-CE Conceptual Layer 


Core Modeling: This part contains the basic elements that always 
exist in a connected environment. For a clear organization, we 
group the elements into the following two parts: 


e Sensor Network modeling, where we represent (i) sensor 
networks; (ii) various sensor types (e.g., static, mobile); (iii) 
the different types of properties (i.e., scalar, multimedia) ob- 
served by sensors; and (iv) the observation values produced 
by sensors (i.e., textual values, multimedia objects and their 
respective metadata). 

e Environment modeling, where we represent (i) platforms (i.e., 
infrastructure, devices) that host sensors or sensor networks; 


(ii) physical infrastructures, such as buildings, and their de- 
tailed description (i.e., location maps, spatial relations); (iii) 
devices, such as mobile phones, and their detailed description 
(i.e., hardware, software, provided services). 


Many other components can still be added to the core part. The 
full description of the environment and sensor network can be 
inspired from ontologies such as SOSA/SSN [12] and HSSN (Hybrid 
Semantic Sensor Network).! 


Event Modeling: This part contains the representation of events 
that one might wish to detect in a connected environment. Here, 
the application domain should also be considered since it affects 
the definition of specific events. For instance a body overheating 
(medical) event cannot be defined the same way as a room overheat- 
ing (environmental) event. Hence, the application domain dictates 
the type of an event, its describing features, its pattern, and the 
required data for its detection. Therefore, we do not detail the event 
modeling, we keep it generic and restrict it to the following com- 
ponents: (i) event that defines an event and its type; (ii) dimensions 
to mathematically represent the event features (provided by the 
Application Domain) in a n-dimensional space; and (iii) event data 
to represent sensor observations that contributed in each event. 
This allows us to have a generic event definition that applies to var- 
ious events from different application domains. All context specific 
details are defined in the application domain and then imported in 
the event definition via the mediator. 


Application Domain Modeling: This part represents the applica- 
tion domain (e.g., medical, energy, military). Since these elements 
differ from one field to another, this part is pluggable into the con- 
ceptual model. It contains basic components/properties denoted 
concepts and relations respectively. Instances of the concept com- 
ponent can be used to define any domain specific entities, and 
instances of the relation property can be used to interconnect the 
concepts (e.g., Figure 5 shows an Event Feature concept that helps 
define event dimensions). This allows the customization of environ- 
ment descriptions and event definitions based on specific contexts. 
For instance, one might wish to represent medical equipment and 
health related constraints when modeling a hospital environment. 
These elements are not the same when describing a shopping center. 
Similarly, what describes medical related events is different from 
normal every day events that happen in a mall. To conclude, this 
part of the data model complements the event description on one 
side, and enriches the environment representation on the other. 


The Mediator: This part of the conceptual model only contains 
properties that ensure the interconnection of the previously men- 
tioned parts (i.e., the core, event, and application domain). For 
instance, a platform hosts a sensor network, the observation values 
produced by the sensors provide event data, the event dimensions 
are defined by event features, and the concept field enriches the 
description of an infrastructure based on the application domain. In 
addition, the mediator can also be used to plug in an external data 
model and align it with the existing elements. 


4.2.3 Logical Layer. The middle layer of EQL-CE, denoted the 
logical layer, allows users to compose/design their queries. The 


'http://spider.sigappfr.org/research-projects/hybrid-ssn-ontology/ 
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process starts by choosing a specific query type. To cover a wider 
set of functionality (cf. Criterion 1), we provide three main groups 
of queries: 

e The Component Definition Language defines the structure of 
components. Various query types are included in this group 
(e.g., CREATE, ALTER, RENAME, DROP). 

e The Component Manipulation Language handles compo- 
nent instances. Here we propose the following query types: 
SELECT, INSERT, UPDATE, and DELETE. 

e Component Access Control (e.g., GRANT, REVOKE). These 
queries manage access rights to component data. We detail 
access control tasks in a dedicated future work. 


EBNF Query Selector 
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| Component Manipulation Language |. Component Access Control 
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Figure 6: EQL-CE Logical Layer 


The process of composing a query is described in Figure 6. First, 
the user chooses the query type (e.g., CREATE, INSERT, DELETE). 
Then, the user starts filling the mandatory statements (e.g., what 
to CREATE, what to SELECT, from which component). Once this 
is done, the user can add optional statements for filtering, order- 
ing, calling external functions. Finally, the query is written using 
an Extended Backus-Naur Form syntax, denoted EBNF [22]. This 
context-free grammar is used to formally describe programming 
languages. It extends the Backus-Naur Form (BNF). We use EBNF 
since it allows the conception of technology independent queries 
(i.e., queries that do not depend from any data model specific syn- 
tax). This highlights the ability to re-use (cf. Criterion 2) EQL-CE 
in different setups, since EBNF can later be parsed, in the physical 
layer, to a specific data model instance, such as SQL or SPARQL, 
depending on the underlying infrastructure [13, 18, 21]. Any com- 
ponent from the conceptual model (i.e., related to the environment, 
sensor network, event, and application domain modeling) can be de- 
fined, manipulated, and protected using these queries (cf. Criterion 
1). Finally, the EBNF query is sent to the physical layer. 


423 Physical Layer & Query Optimizer. The bottom layer of 
EQL-CE (cf. Figure 7) saves the received EBNF queries in a dedi- 
cated storage unit for future use. Then, it parses the aforementioned 
queries into a specific syntax depending on the underlying data 
model (e.g., SOL, SPARQL). Finally, the parsed query is saved and 
sent to the query run engine where it is executed. If needed, exter- 
nal functions, methods, or even algorithms are called (e.g., string 
comparison functions, mathematical libraries). All the above de- 
scribes how EQL-CE can be re-used in various contexts, since it is 
independent from any technological infrastructure (cf. Challenge 3). 
Using the EBNF queries, one can define the data model and all its 
various related components (cf. Challenge 1). In addition, EOL-CE 
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Figure 7: EQL-CE Physical Layer 


allows users to handle instances of each component for data re- 
trieval, modification, deletion, security/privacy, and event detection 
by providing a plethora of functionality (cf. Challenge 2). However, 
when defining specific events, one might need to manage the spatial 
distribution of sensors over a location (cf. Need 2). For instance, con- 
sider k-nearest sensors to a specific location, or all sensors within 
a range R of a point in space. Also, one might consider mathemat- 
ical distributions of sensors over a zone (e.g., even distribution). 
Similarly, one might need to manage the temporal distribution of 
sensor observations for specific events (cf. Need 3). For example, 
selecting the k-most recent sensor observations, or all observations 
that were produced during a specific time lapse. Also, one might 
need to select observations based on specific sensing rates. To do 
so, the query optimizer allows the integration of spatial/temporal 
distribution functions in the queries (cf. Criterion 3 and Challenge 
4). Finally, in dynamic connected environments sensors might suffer 
from breakdowns, mobile sensors could enter/leave the network, or 
even change locations. This is challenging since event definitions 
rely on sensors and their provided observations. Hence, some pre- 
viously defined events might become obsolete over time. Therefore, 
in some cases, queries need to be re-written or updated in order to 
handle the dynamicity of the environment, and keep up with its 
evolution (cf. Criterion 4 and Challenge 5). This is also possible via 
the query optimizer. In this paper, we do not fully detail the query 
re-writing and spatial/temporal distribution functions. We leave 
this to a dedicated future work. 


5 ILLUSTRATION EXAMPLE 


In this section, we illustrate how EQL-CE works. The aim here, 
is to demonstrate the component syntax and provide some EBNF 
query examples. To do so, we consider the smart mall scenario of 
Section 2. For the sake of brevity, we do not define the entire con- 
nected environment (e.g., all locations, sensors in the mall). A fully 
detailed example (i.e., containing various query types and compo- 
nents) can be found on the following link: http://spider.sigappfr. 
org/research-projects/eql-ce/smart-mall/. 


5.4 Environment Modeling 


The mall is an infrastructure having a location map and various 
locations (e.g., shop 1, food court) that are tied by spatial relations. 
First, we need to define these components. Then, we INSERT in- 
stances. Syntax 1 defines an infrastructure as an entity that has a 
location map, and a set of embedded platforms (e.g., infrastructures, 
devices). A location map contains various locations (cf. Syntax 2). 
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Finally, each location can be spatially tied to other locations (cf. 
Syntax 3). 


Syntax 1: Defining the structure of an Infrastructure 


CREATE INFRASTRUCTURE ( «id» = <string> , 
[ LOCATION MAP «id» = «string» ,] [ ( HOSTED PLATFORM «id» = «string» 31) ; 


Syntax 2: Defining the structure of a Location Map 


CREATE LOCATION MAP ( «id» = <string> , [ ( LOCATION «id» = <string> 2) 1) ; 


Syntax 3: Defining the structure of a Location 


CREATE LOCATION ( «id» - «string» , 
[ { RELATION TYPE «relation. type» = 'directional'|'distance'|'topological', 
RELATION NAME «relation name»? = 'above'|'below'|'leftOf'|'opposite'| 
'rightOf'|'closeTo'|'farFrom'|'contains'|'covers'|'crosses' | 

'disjoint'|'equals'|'overlaps'|'touches', 
OTHER LOCATION «id» = <string> ) 1) ; 


In addition, one can rename, drop, or alter component definitions. 
We give an example for each of these queries in the following: 


Syntax 4: Renaming a component 


RENAME COMPONENT «id» = <string>, «new id» = <string> ; 


Syntax 5: Dropping a component 


DROP COMPONENT «id» = <string> ; 


Syntax 6: Altering the Location component (add a description field) 


ALTER LOCATION ( «id» - «string» , 
ADD [ DESCRIPTION «description» = <string> ,] ) ; 


Once the components' definitions are established, we can start 
creating instances using INSERT queries. Queries 1, 2, and 3 instan- 
tiate an infrastructure, a location map, and location components 
respectively. We do not cover all locations found in Figure 1 to 
avoid redundancies. 


Query 1: Inserting an Infrastructure instance 


INSERT INFRASTRUCTURE HAVING ( «id» = 'Mall Infra', 
LOCATION MAP «id» = 'Mall Map’ ) ; 


Query 2: Inserting an Location Map instance 


INSERT LOCATION MAP HAVING ( «id» - 'Mall Map' , 
LOCATION «id» - 'Shop 1', 'Movie Theatre' ) ; 


Query 3: Inserting two Location instances 


INSERT LOCATION HAVING ( «id» = 'Shop 1' , 
RELATION TYPE «relation type» - 'directional', 
RELATION NAME «relation name» - 'leftOf', 
OTHER LOCATION «id» 'Movie Theatre' ) ; 


INSERT LOCATION HAVING ( «id» - 'Movie Theatre' , 
RELATION TYPE «relation type» - 'directional', 
RELATION NAME «relation name» - 'rightOf', 

OTHER LOCATION «id» 'Shop 1' ) ; 


In addition, one can select, update, or delete instances of compo- 
nents. We give an example for each of these queries in the following: 


Query 4: Selecting all Locations from the Location Map 


SELECT LOCATION «id» FROM LOCATION MAP WHERE 
LOCATION MAP «id» - 'Mall Map'; 


Query 5: Updating the location relation between Shop 1 and Movie Theatre 


UPDATE LOCATION CHANGE RELATION NAME «relation name» = 'leftOf' 
INTO RELATION NAME «relation name» - 'opposite', 
WHERE ( LOCATION «id» - 'Shop 1', OTHER LOCATION «id» - 'Movie Theatre'); 


Query 6: Deleting a Location 


DELETE LOCATION WHERE LOCATION «id» - 'Shop 1'; 


This concludes the definition and manipulation syntax/queries 
for the environment part. Next, we discuss sensor networks, events, 
and application domains. Due to space limitations, we focus mainly 
on the syntax to define the components' structures. 


5.2 Sensor Network Modeling 


The sensor network hosted in the mall comprises of various static 
and mobile sensors. They monitor the environment properties and 
produce observations. Some properties/observations are scalar (e.g., 
temperature) while others are multimedia (e.g., video surveillance). 
Therefore, we define here the following components: (i) Scalar 
Property; (ii) Media Property; (iii) Scalar Value; (iv) Media Value; 
and (v) Sensor. Syntax 7 details the structure of a scalar property 
which is mapped to a set of scalar observation values. Similarly, a 
media property (cf. Syntax 8) is mapped to a set of media values and 
a specific type (e.g., audio, video, image). Syntax 9 defines any scalar 
observation value produced by a sensor. Each observation has a 
timestamp, location, related sensor, a datatype, a value, and a unit. 
Media observation values are detailed in Syntax 10. Each media 
value is composed of a data object and a set of metadata/value 
pairs. Similarly to scalar values, each media value has a timestamp, 
location, and a related sensor. Finally, a sensor is defined as en entity 
that has a type (e.g., static, mobile), a current location/coverage 
area, a set of previous locations/coverage areas/capabilities. Each 
sensor is capable of sensing specific properties and can be hosted 
on a particular platform (a device or an infrastructure). Syntax 11 
describes the sensor component structure. 


Syntax 7: Creating a Scalar Property 


CREATE SCALAR PROPERTY ( «id» - «string» , 
[ { SCALAR VALUE «id» = «string» } 1) ; 


Syntax 8: Creating a Media Property 


CREATE MEDIA PROPERTY ( «id» = <string> , 
[ MEDIA TYPE «id» = 'audio' | 'image' | 'video' , ] 
[ € MEDIA VALUE «id» = <string> ) 1) ; 


Syntax 9: Creating a Scalar Value 


CREATE SCALAR VALUE ( «id» = <string> , 

[ DATATYPE «dt» = 'integer' | 'float' | 'boolean' | 'date' | 'time' | 
'date time' | 'character' | 'string' , VALUE «val» = «empty» , ] 

[ UNIT «id» = <string> , ] [ TIMESTAMP «val» = «empty» , ] 

[LOCATION «location id» = «empty» , ] [ SENSOR «sensor. id» = «empty» 1) ; 
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Syntax 10: Creating a Media Value 


CREATE MEDIA VALUE ( <id> = <string> , 
[ DATA OBJECT TYPE «dot» = ‘audio segment'|'visual segment' , 

DATA OBJECT «do» = «empty» , ] 

[ { METADATA «meta» = 'text annotation descriptor' | 'fundamental frequency' | 
'harmonic descriptor'|'harmonic spectral centroid'| 

'harmonic spectral deviation'|'harmonic spectral spread'| 

'harmonic spectral variation'|'log attack time'|'power descriptor'| 
‘spectral centroid'|'spectrum basis'|'spectrum centroid'| 

‘spectrum envelop'|'spectrum flatness'|'spectrum projection'| 

‘spectrum spread'|'temporal centroid'|'waveform'|'camera motion descriptor' | 
‘motion activity descriptor'|'parametric motion descriptor' | 

‘trajectory descriptor'|'warping parameters'|'bounding box descriptor' | 
‘point descriptor'|'media duration descriptor'|'media time point descriptor' | 
‘color layout descriptor'|'color structure descriptor' | 

‘contour shape descriptor'|'dominant color descriptor' | 

‘edge histogram descriptor'|'face recognition descriptor' | 

‘scalable color descriptor’ , VALUE «val» = «empty» } ,] 

[ TIMESTAMP «val» = «empty» , ] [LOCATION «location id» = «empty» , ] 

[ SENSOR «sensor id» = «empty» ] ) ; 


Syntax 11: Creating a Sensor 


CREATE SENSOR ( «id» = <string> , 
C [ HAVING 
[ SENSOR TYPE «sensor type» = 'static' | 'mobile' , ] 
[ CURRENT LOCATION «id» = <string> , ] 
[ { PREVIOUS LOCATION «id» = «string» , TIME INTERVAL «ti» = «empty» ) , ] 
[ CURRENT COVERAGE AREA «id» = <string> , ] 
[ ( PREVIOUS COVERAGE AREA «id» = «string» , TIME INTERVAL «ti» = «empty» ) , ] 
L € CAPABILITY «id» = <string> , VALUE «val» = <string> } 11 , ) 
( E SENSING ( SCALAR PROPERTY «id» - «string» | 
MEDIA PROPERTY «id» = «string» ) ] , ) 
( C HOSTED ON PLATFORM «id» = <string> 1) ) ; 


5.3 Event Modeling 


Here we detail the event modeling in EQL-CE. We define the event 
as a n-dimensional space where each dimension mathematically 
represents an event describing feature. The latter are provided by 
the application domain (cf. Figure 5). Moreover, event data is the 
set of sensor observations that help detect the event (i.e., event data 
belong to the event's n-dimensional space). Therefore, an event 
has a set of dimensions and event data. In addition, an event also 
has a set of sensors that provide the required observations for the 
detection. Finally, we added a type parameter to the event definition 
to distinguish elementary or atomic events (i.e., that require one 
observation from one sensor) from complex events (i.e., that require 
various observations from one sensor), and composite ones (i.e., that 
require various observations from different sensors). The following 
syntax defines event modeling components. 


Syntax 12: Creating an Event Structure 


CREATE EVENT ( «id» = <string> , 
[ EVENT TYPE «event type» - 'elementary' | 'complex' | 'composite' , ] 
[ ( SENSOR «sensor id» = <string> ) , ] 
[ ( DIMENSION «dimension» = <string> ) , ] 
[ € EVENT DATA «data. object» = SCALAR OBSERVATION «so» | 
MEDIA OBSERVATION «mo» ) 1) ; 


The following query defines a particular event instance, denoted 
"Overheat in Shop 1’, where the three main dimensions are time, 
location, and temperature. This event relies on scalar temperature 
observations that surpass the value 30. Once the event instance 
is defined, any external event detection mechanism (e.g., eVM cf. 
Figure 3 can use this definition to detect occurrences of this event. 


Query 7: Creating an Event Instance 


INSERT EVENT HAVING ( «id» - 'Overheat in Shop 1' , 
EVENT TYPE «event, type» = 'elementary', 
( SENSOR «sensor. id» = ANY }, 
{ DIMENSION «dimension» = 'Time', 'Location', 'Temperature' }, 
{ EVENT DATA «data object» = SCALAR OBSERVATION «so» } ), 
WHERE ( <so>.<id> = 'Temperature', 

<so>.<location_id> = 'Shop 1', <so>.<val> > 30 ) ) ; 


To keep up with the environment changes (cf. Criterion 4), one 
could need to re-write obsolete event definitions. Query re-writing 
is provided automatically by the query optimizer (cf. Figure 3). How- 
ever, users can request an update at any time. This is illustrated in 
the following query where we update the event definition provided 
in Query 7 by only considering observations from Sensor 1. 


Query 8: UPDATING an Event Instance 


UPDATING EVENT CHANGE ( 
SENSOR «sensor. id» = 'Sensor 1', 
WHERE (EVENT «id» = 'Overheat in Shop 1') ) ; 


5.4 Application Domain Modeling 


As previously mentioned in the conceptual layer, application do- 
mains have different components, inter-component relations, and 
targeted events. Therefore, we provide here a generic definition of 
an application domain related components and relations (denoted 
Concept, Relation respectively cf. Syntax 13). We also provide a 
definition for event describing features (cf. Syntax 14) and their 
datatypes (cf. Syntax 15) that can be instantiated in different do- 
mains. 


Syntax 13: Creating a Concept/Relation 


CREATE CONCEPT ( «id» = <string> , [ ( ATTRIBUTE «id» } 1) ; 
ATTRIBUTE «id» - CONCEPT «id» | VARIABLE ( «label» - «string? , 
DATATYPE <datatype> = 'integer'|'float'|'boolean'|'date'|'time'| 
‘date time'|'character'|'string', VALUE «val» = «empty? ) ; 


CREATE RELATION ( «id» = <string> , 
[ { CONCEPT «source. id» = <string> ) , ] 
[ { CONCEPT «target. id» = <string> ) ] ) 


Every event feature has an identifier, a set of granularities (e.g., 
second, minute, hour for time), a function that converts a gran- 
ularity to another (e.g., 1 minute = 60 seconds), a boolean field 
indicating if intervals can be created from this feature's values, and 


a datatype. 


Syntax 14: Creating an Event Feature 


CREATE EVENT FEATURE ( «id» - «string» , 
[ GRANULARITY SET { VALUE «val» = «string» ) , ] 

[ GRANULARITY FUNCTION «id? = <string> , ] 

[ INTERVAL «boolean» = '0' | '1' , J 

[ EVENT FEATURE DATATYPE «event. feature datatype id» = «string» ] ) ; 


An event feature datatype has an identifier, a primitive datatype , 
a range of allowed values (i.e., lower bound min, upper bound max), 
and a function that measures the distance between values having 
the same event feature datatype. These details help translate event 
describing features (application domain) into dimensions of the 
event's n-dimensional space (event modeling) using the mediator 
(cf. Figure 5). 
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Syntax 15: Creating a Event Feature Datatype 


CREATE EVENT FEATURE DATATYPE ( «id» = <string> , 

[ DATATYPE <datatype> = 'integer'|'float'|'boolean'|'date'|'time'| 
‘date time'|'character'|'string', ] 

[ RANGE ( MIN «min» = VALUE «val» , MAX «max» = VALUE «val» ) , ] 
[ DISTANCE «function id» = «string» ] ) ; 


6 CONCLUSION & FUTURE WORK 


Many challenges emerge when proposing a EQL adapted to con- 
nected environments. In this paper, we addressed the issues of 
re-usability, and covering various components/functionality. To do 
so, we proposed EQL-CE: a three layered event query language for 
connected environments. We detailed its conceptual, logical, and 
physical layers. EQL-CE users compose EBNF queries, that can be 
later parsed into SOL, SPARQL, or other languages (re-usability). 
Our proposal covers various connected environment components 
(environments, sensor networks, events, and application domains) 
and functionality (definition, manipulation, access control, event 
detection). We also proposed a query optimizer that allows query 
re-writing and the integration of spatial/temporal distribution func- 
tions. As future work, we would like to detail the security/privacy 
related queries and distribution/query re-writing functions. Also, 
we are currently developing an online simulator to allow users to 
run tests on a connected environment (e.g., the smart mall). Finally, 
we want to address additional challenges such as integrating batch 
queries and continuously processing data streams. 
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ABSTRACT 


Semantic web techniques (e.g., ontologies) have been recently adopted 
for sensor network modeling. However, existing works do not fully 
address these challenges: (i) representing different sensor types (e.g., 
mobile/static sensors) to enrich the network with different data and 
ensure better coverage; (ii) representing a variety of platforms (e.g., 
environments, devices) for sensor deployment, thus, integrating 
new components (e.g., mobile phones); (iii) representing the diverse 
data (scalar/multimedia) needed for various applications (e.g., event 
detection); and (iv) proposing a generic model to allow re-usability 
in various application domains. In this paper, we propose HSSN, 
an ontology that extends the Semantic Sensor Network (SSN) on- 
tology which is already re-usable and considers various platforms. 
We extend the representation of sensors, sensed data, and deploy- 
ment environments to cope with these challenges. We evaluate the 
consistency, accuracy, clarity, and performance of HSSN. 


CCS CONCEPTS 


* General and reference — General conference proceedings; 
* Information systems — Ontologies; e Computer systems 
organization — Sensor networks. 


KEYWORDS 
Semantic Sensor Networks, Ontology, Sensor Mobility 


ACM Reference Format: 

Elio Mansour, Richard Chbeir, and Philippe Arnould. 2019. HSSN: An Ontol- 
ogy for Hybrid Semantic Sensor Networks. In 23rd International Database 
Engineering & Applications Symposium (IDEAS’19), June 10-12, 2019, Athens, 
Greece. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3331076. 
3331102 


1 INTRODUCTION 


Recently, Sensor Networks (SNs) have impacted more and more 
application domains [14] such as environmental sensing, military, 
and medical fields. Various sensors (e.g., camera, microphone) are 
nowadays embedded in smart phones, and capable of sensing use- 
ful data for various purposes (e.g., pollution monitoring in a city). 
Therefore, considering such devices, and other equipment capable 
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of sensing, is very beneficial for knowledge extraction in sensor 
networks. Nonetheless, SNs may produce heterogeneous data, that 
have to be collected, processed, and analyzed in order to provide 
various services for network managers. Representing, sharing, and 
integrating the aforementioned data is a challenging task. In or- 
der to address this challenge, semantic web techniques, such as 
ontologies, have been adopted for their information representation. 
However, existing approaches on sensor network representation 
[1-4, 6, 11, 13] are restrictive due to the following issues: 


e Lack of platform diversity: existing approaches do not con- 
sider equipment with embedded sensors (e.g., smart phones, 
drones, machines) as platforms, in addition to traditional 
platforms (e.g., buildings, cities, offices) where sensors are 
deployed. Extending the platform representation, by both 
considering and detailing the representation of various types 
of platforms, allows the addition of new components to the 
network, nested platforms, and dynamic, collaborative sens- 
ing activities (e.g., crowd-sensing). 

e Lack of sensor diversity: these works do not represent dif- 
ferent sensor types (e.g., mobile/static sensors, simple sen- 
sor nodes/multi-sensor devices, sensors capable of sensing 
scalar/multimedia properties). Providing a more detailed 
sensor representation that considers various attributes (e.g., 
mobility) improves network coverage, and allows sensor 
tracking and dynamic sensing. 

e Lack of data diversity: most works cover scalar environment 
properties (i.e., mainly focus on scalar data such as tempera- 
ture, motion, and neglecting multimedia data such as sounds, 
images, and videos). Since several devices are capable of sens- 
ing both types, and data diversity is required for different 
application purposes (e.g., event detection), it is important 
to cover scalar and multimedia data in the representation. 

e Lack of re-usability: these approaches are heavily linked to a 
specific application domain. The sensor network modeling 
should remain generic and re-usable in different contexts. 


To answer these challenges, we present here an extension of the 
widely used Semantic Sensor Network ontology (SOSA/SSN) [7] 
called HSSN. It allows the representation of hybrid sensor net- 
works, i.e., networks containing mobile/static sensors, scalar/multi- 
media properties, and infrastructures/devices as platforms where 
sensors are deployed. We chose to extend SSN since it is already 
re-usable in various contexts and allows the representation of dif- 
ferent platforms. Nonetheless, sensor and data diversity are not 
fully developed. Our proposal adds diverse data, sensors, and details 
the description of various platform types. In addition, HSSN does 
not contain domain specific knowledge and can be easily aligned 
with other ontologies (e.g., mobile phone, smart building ontologies 
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[16]). 

The rest of the paper is organized as follows. Section 2 illustrates 
a scenario that motivates our proposal. Section 3 reviews related 
work regarding mobility, platforms, and sensed data. Section 4 de- 
tails the HSSN ontology. Section 5 describes the experimental setup 
and results. Finally, section 6 concludes the paper and discusses 
future research directions. 


2 MOTIVATING SCENARIO 


To highlight the utility of our proposal, we choose the following 
scenario (we only use this example to concretely illustrate the needs, 
challenges, and motivations behind our work. We do not consider 
it to be a generic, all summarizing, sensor network application sce- 
nario). Consider a smart mall/shopping center (cf. Fig.1). In order to 
optimize client comfort, health, and security, the smart mall relies 
on a set of sensors (s1-s9) to monitor the environment. Video surveil- 
lance cameras (s;-se) monitor security related events. Humidity, 
CO», and temperature sensors (s7, sg, and so respectively) make 
observations that help regulate the indoor air quality, and temper- 
ature. The sensed data is stored and used for these applications. 
However, many improvements still need to be integrated: 


Food Court 


Grocery Store 


Figure 1: Smart Mall Example 


o Need 1- Provide better temperature/air quality readings: re- 
lying on measures from a multitude of sensors (instead of 
only one) allows a more precise monitoring of the environ- 
ment. Currently, this is not possible since there is only one 
temperature/air quality sensor in the mall. 

e Need 2- Keep track of client positions in the mall since it is 
useful to know: the number of occupants in each zone, client 
positions for tracking suspicious/interesting behaviours. Cam- 
eras (s1-se) are used by mall agents to monitor limited events 
and cannot track client locations everywhere. 

e Need 3- Cover all areas of the mall: this is critical for client 
security and safety. In the current setup, many uncovered 
areas exist (e.g., no temperature monitoring in the movie 
theater, no video surveillance in Shop 2). 

e Need 4- Provide a rich documentation of critical events: in 
order to increase the understanding of events (e.g., when 
reporting incidents, providing evidences), rich descriptions 
should be provided to police with a variety of sensed multi- 
media and scalar data (e.g., video, audio, image, temperature, 
humidity). Currently, reports on attack incidents (e.g., gun- 
shot) rely only on video surveillance footage (e.g., no noise 
levels to confirm the gunshot, no motion data to describe 
how people ran away). A bigger data variety is needed. 


e Need 5- Adapt to changing event detection needs: sometimes 
new/spontaneous events need to be detected, the mall should 
be able to sense the required data and detect these events. 
However, the current sensor configuration/deployment and 
sensed data cannot be easily modified. This doesn't allow 
the detection of new events. 


In order to address these issues, the mall managers would need 
to add more sensors to cover all zones. This ensures full coverage 
of the mall (Need 3), and allows multiple observations from each 
zone for aggregation (Need 1). In addition, they could replace the 
cameras with more advanced ones that enable image processing for 
tracking purposes (Need 2). However, this increases the equipment, 
maintenance, and implementation costs without addressing Needs 
4 and 5. A more appropriate solution would be to integrate visitors' 
mobile phones (since they embed sensors) as mobile sensors in the 
mall's network, while avoiding excessive resource consumption 
from the devices (e.g., draining a phone's battery). This provides 
the following benefits: (i) sensor mobility provides observations 
from different areas of the mall, multiple sensors can therefore 
collaborate to calculate more reliable air quality/temperature mea- 
sures (Need 1); (ii) mall visitors can easily be tracked using their 
connected mobile phones (Need 2), location information can also 
be used to discover uncovered areas (Need 3); (iii) using various 
sensors from different devices helps cover a wider array of scalar/- 
multimedia properties (Need 4); and (iv) these devices provide a 
diversity of hardware (e.g., sensors), software, and services that can 
be adapted to changing event detection needs (Need 5). However, 
when adding mobility, diverse data, and devices to the network, the 
following challenges emerge: 


* Challenge 1: How to expressively describe locations in the 
mall? 

e Challenge 2: How to consider ad-hoc devices in the network? 
How to query them based on their capabilities (e.g., without 
draining their batteries)? How to represent the services that 
they provide? 

e Challenge 3: How to track locations and coverage areas of 
mobile sensors? 

e Challenge 4: How to collect scalar/multimedia observations 
from sensors? 


Other challenges also exist when modeling sensor networks. How- 
ever, we address here the aforementioned four challenges from a 
data modeling perspective by proposing an extension of the seman- 
tic sensor network ontology that includes mobility, platform, and 
data related concepts. 


3 RELATED WORK 


In this section, we study existing sensor network ontologies. We 
focus our review on sensor mobility, deployment platforms, and 
semantic representation of multimedia data. We compare these 
works based on the following criteria: 


(1) Sensor diversity: Indicating if different types of sensors exist 
in the sensor network (e.g., mobile/static sensors, simple 
nodes/multi-sensor equipment, sensors capable of sensing 
scalar/multimedia properties). 
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(2) Platform diversity: Stating if the approach allows and de- 
tails the description of different platforms where sensors are 
deployed (e.g., in infrastructures, on devices). 

(3) Data diversity: Denoting the approach's ability to handle 
various data/properties (e.g., scalar, multimedia). 

(4) Re-usability: Indicating if the approach is re-usable in various 
contexts. 


3.1 Sensor Diversity 


In [2], the authors focus mainly on features that describe the sensor 
nodes, their functionality, and their current CPU, memory, and 
power supply states (in order to determine the future state of the 
WSN). However, they do not represent different types of sensors. 
In [6], the authors provide a set of ontologies describing missions, 
tasks, sensors, and deployment platforms for sensor to task assign- 
ment. Unfortunately, different types of sensors were not considered. 
In [7], the authors propose the SOSA/SSN! ontologies. Together, 
they describe systems of sensors and actuators, observations, the 
used procedures, properties, and so forth. SOSA/SSN propose sim- 
ple sensor node representation, as well as (sensing) systems/devices. 
However, SOSA/SSN do not propose any mobility-related concepts, 
nor multimedia data/properties. The authors only consider one 
aspect of sensor diversity (i.e., simple sensor nodes/sensor systems). 
In [1], the authors propose an extension of SSN, denoted MSSN 
(Multimedia SSN), where they detail the technical aspects of multi- 
media data (e.g., video, audio segments, frequencies). In this work, 
the authors improve the sensor diversity of SOSA/SSN by adding a 
media sensor (i.e., a sensor type that observes multimedia proper- 
ties). However, they do not achieve full sensor diversity as they do 
not consider sensor mobility (i.e., mobile/static sensors). 


3.2 Platform Diversity 


The authors in [9] only consider embedded sensors on mobile 
phones to monitor noise pollution. In [4], the authors rely on tradi- 
tional deployment of sensor nodes in the wilderness to detect fire 
events. The problem is, these works do not provide any platform di- 
versity. In the SSN ontology [7], sensors are deployed on platforms. 
SSN also introduces systems, that can integrate various sensors, 
actuators, and samplers. Therefore, SSN provides a foundation for 
sensor deployment on various platforms (e.g., traditional deploy- 
ment on platforms, embedding sensors in systems and devices). 
However, the differences between theses platforms is not detailed 
in SSN. The description of physical infrastructures/environments 
such as smart buildings and cities (where it would be interesting to 
model maps and locations) is different than of machines, drones, 
and devices that host sensors (where it would be interesting to 
model hardware and software). It is better to distinguish and detail 
the description of different platform types to better understand the 
environments where sensors are deployed (e.g., for location-based 
services in infrastructures, task assignment based on hardware/- 
software capabilities for devices). MSSN [1] suffers from the same 
limitation since it is based on the SSN ontology and does not add 
any new concepts related to platforms. 


T https://www.w3.org/TR/vocab-ssn/ 


3.3 Data Diversity 


In [5], the authors represent images for object recognition pur- 
poses. The scope of their work does not extend to other types of 
multimedia data (e.g., video, audio). In [10], the authors are also 
limited to image representation, since they propose an approach 
for object-based image retrieval. In [11], the authors monitor noise 
pollution in urban zones by sensing (audio) noise levels using occu- 
pants' mobile phones. The authors only consider noise data, and 
geo-locations in order to generate a noise level map. Therefore, their 
proposal does not fully consider data diversity (e.g., video, images, 
other scalar data). The SSN ontology [7] does not consider multime- 
dia observations. It details scalar sensed data. This motivated the 
proposal of MSSN [1] where the authors represent multimedia data 
in sensor networks. For each multimedia observation value, the 
authors associate data descriptors (denoted media descriptors), and 
data segments (denoted media segments). Their proposed ontology, 
MSSN, complements the SSN ontology [7] since the latter does not 
cover multimedia contents nor multimedia sensors. 


3.4 Re-usability 


In [9], the authors propose a noise pollution monitoring solution in 
a city using mobile phones to sense noise. The authors enrich the 
sensed information by allowing users to add contextual information 
to their sensor observations. However, it lacks the genericity needed 
for it to be reusable in other contexts. In [1], the authors propose a 
multimedia wireless sensor network ontology for event detection 
purposes (the authors include concepts related to atomic, complex 
events, and event detection/composition). These added concepts are 
domain specific and not necessary in other application scenarios. 
This restricts MSSN's re-usability. Each of these works are task- 
centric and heavily linked to an application purpose. The SSN 
ontology [7] remains generic and re-usable in various contexts 
since it is extensible and does not contain any concepts that link it 
to any specific application. 


3.5 Discussion 


The aforementioned works do not fully integrate sensor diversity 
in their representation of sensor networks (i.e., static/mobile sen- 
sors, simple node/multi-sensor devices, and scalar/multimedia sen- 
sors). The SSN ontology [7] is a culmination of much of the related 
work on semantic sensor networks and is the most widely used (re- 
usable). In addition, SSN is extensible, facilitates alignments with 
other standards, and allows the integration of new concepts. The 
MSSN ontology [1], integrates multimedia data in SSN. Therefore, 
we propose to extend SSN since: (i) it partially allows sensor diver- 
sity; (ii) it is re-usable and does not contain any domain specific 
knowledge; and (iii) it allows having various platform types. More- 
over, we do not neglect MSSN for its ability to cover multimedia 
data (data diversity). Therefore, our proposal will extend SSN and 
use key MSSN concepts in order to achieve full sensor diversity, 
platform diversity enriched with detailed descriptions of each type 
(e.g., infrastructures, devices), and finally data diversity through 
the coverage of scalar/multimedia sensed data. 
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4 HSSN ONTOLOGY 


In this section, we detail our proposed extension of the SSN on- 
tology, and mainly our additions related to: (i) sensor diversity; 
(ii) platform diversity; and (iii) data diversity. The following pre- 
fixes sosa:, ssn:, mssn:, time:, and hssn: refer to the SOSA[7], SSN[7], 
MSSN[1], Time[8], and HSSN ontologies respectively. We begin 
first by describing sensor-related concepts. 


4.1 Sensor Diversity 


4.1.1 Sensor Mobility. Fig.2 illustrates the sensor types added in 
HSSN. The concept Sensor already exists in the SSN ontology, where 
mobility is not extensively developed. Therefore, we add two child 
concepts of Sensor: (i) MobileSensor, describing any sensor that 
has the ability to move or change location; and (ii) StaticSensor, a 
sensor that does not change location in time. This allows the sensor 
network to have diverse sensor types (cf. Criterion 1 - Section 3). 


4.1.3. Sensor Tracking. Every sensor has a Location. To consider 
mobility, one should be able to locate any sensor at all times. The ob- 
ject property isCurrentlyLocatedAt maps each sensor to its current 
Location (cf. Challenge 3 in Section 2). This is specifically impor- 
tant for tracking mobile sensors, since static sensors do not change 
locations (cf. Fig.3). A hasPastLocation property is added to retrieve 
the previous positions of a (mobile) Sensor, and also a hasLocation- 
Time (cf. Fig.4) property is added to map these positions to time 
instants or intervals in order to track sensors (temporal entities are 
extracted from Time ontology [8]). 
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Figure 2: HSSN Sensor View 
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Figure 3: Sensor/Location Mapping 
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Figure 4: Previous Location/Time Mapping 


4.1.3 Coverage Area. Each Sensor, mobile or static, has a Cover- 
ageArea (cf. Fig.5), a geographical zone that contains any sensing 
activity (i.e., any happening outside of this zone is not detected by 
the Sensor). In order to represent coverage areas, we consider the 
following: (i) a CoverageArea is bound to the sensor's current Loca- 
tion; and (ii) the geographical spread of a CoverageArea is affected by 
the sensing range and sensing angles (horizontal and vertical orien- 
tation) of the concerned Sensor. We represent the coverage area as 
a sector of space (Fig.6 shows a horizontal slice of the space) where 
S is the focal point (the sensor's current Location), a, p € [0; 27] 
are the angles that define the horizontal/vertical rotational spread 
of the coverage area respectively, and the distance SA = SB is the 
sensing range that defines the extent of the coverage area. The 
angles and range depend of the sensor's capability properties. For 
instance, a temperature sensor has a = f) = 27, but a surveillance 
camera has a = 4, f = @ if the camera lens is limited to a 45? 
horizontal angle, and a 30? vertical angle. Similarly, the sensing 
range varies from one sensor to another (e.g., 10, 20, 50 meters). 
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Figure 5: Coverage Area 
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Figure 6: Coverage Area - Horizontal Spread 


The composition of a CoverageArea is explained in Fig.7. The 
SensingLocation is equivalent to the sensor's Location, and the an- 
gles and range of the CoverageArea are equivalent to the sensor's 
HorizontalAngle, VerticalAngle, and Range properties that we added 
in HSSN as part of a system's properties. Since static sensors are 
immobile, it is easy to know their coverage areas using the sensor's 
location, and its sensing range and angles. In contrast, knowing 
the coverage areas of mobile sensors is more challenging, since 
these areas move when the sensors move. In order to keep track 
of these changes, the object property currentlyCovers maps each 
Sensor to its current CoverageArea (cf. Fig.8). Also, the property 
hasPastCoverageArea maps mobile sensors to their respective sets 
of previous coverage areas (cf. Challenge 3 in Section 2). Finally, 
hasCoverageTime is the property that maps previous coverage ar- 
eas to temporal entities (i.e., time instant or interval from Time 
ontology [8]) for tracking purposes (cf. Fig.9). 
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Figure 7: Coverage Area Composition 
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Figure 8: Sensor/Coverage Area Mapping 
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Figure 9: Coverage Area/Time Mapping 


42 Platform Diversity 


42.1 Infrastructure Representation. In SSN[7], sensors are deployed 
on platforms. In Fig.10, we define the following child concepts of 
Platform: (i) Infrastructure, a physical environment having locations 
where sensors could be deployed (cf. Challenge 1 in Section 2); 
and (ii) Device, an electronic equipment where sensors could be 
embedded (cf. Challenge 2 in Section 2). This allows different types 
of deployments such as the traditional deployment in environments 
(e.g., buildings, malls) or nested deployment of multi-purpose de- 
vices that in turn embed sensors (e.g., mobile phones). This provides 
platform diversity (criterion 2 cf. Section 3). Every Infrastructure de- 
scribes a specific physical environment where sensors are deployed. 
Therefore, infrastructures can host platforms such as other infras- 
tructures (e.g., cities host buildings) and devices (e.g., buildings host 
mobile phones). However, devices can embed systems of sensors, 
actuators, and samplers but cannot host infrastructures (e.g., build- 
ings). Each Infrastructure is described by a Location Map which 
contains (isComposedOf property) a set of Locations (cf. Fig.11). 
For example, a building is an Infrastructure that has a Location- 
Map. The latter describes the spatial relations between individual 
Locations in the building such as floors, offices, etc. HSSN uses topo- 
logical, distance, and directional relations to describe the spatial 


ties that exist between individual Locations. We integrate the afore- 
mentioned location-related concepts in order to locate sensors, and 
better understand the spatial constraints/setup of the Infrastructure. 
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Figure 10: Platform Representation 
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Figure 11: Infrastructures 


4.22 Device Representation. A Device is another type of Platform 
where sensors are deployed. It is introduced in HSSN to repre- 
sent mobile phones and other sensing equipment. A Device has 
sub-concepts for storage, communication, processing, and power 
supply, in addition to the ability of embedding sensors (using the de- 
ployEntity concept cf. Fig.12). These concepts describe the Hardware 
of a Device. The Software part is also represented. A Device could 
be used for various purposes (e.g., representing mobile phones for 
mobile phone sensing, machines with mounted sensors for fault 
detection in an Industry 4.0 scenario). The hardware and software 
representation allows complex queries such as assigning sensing 
tasks to devices based on their processing capabilities, or battery 
status (cf. Challenge 2 in Section 2). Finally, each Device can provide 
a set of services. Fig.13 illustrates our service modeling, inspired 
by the Web Service Modeling Ontology (WSMO) [12]. We created 
generic concepts that can be aligned with WSMO. We do not aim 
to detail the service description to allow alignments with any other 
service ontology. We limit the service modeling to the following 
concepts: Service Metadata describes the properties of a Service. 
The Input represents the set of variables and constraints required 
for correct service execution, while the Output is the set of gen- 
erated results. The functionality of a service is described by the 
Capability concept which is mapped to a specific UserGoal or objec- 
tive (i.e., a user desire satisfied by the service). Users communicate 
with a service through UserInteractionInterfaces (choreography in 
WSMO). Finally, services communicate with each other via the 
ServicelnteractionInterface (service orchestration in WSMO). Finally, 
the infrastructure and device detailing also improves sensor di- 
versity by allowing the representation of simple sensor nodes in 
infrastructures, multi-sensor systems, and multi-sensor devices. 
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Figure 12: Device Components 


hasVariables hasCapability 
A e J eus — ) Seth (C emp) 


T T 
is-a haslnterfaces i isRequestedBy| 
1 


1 
1 
i í 
k hssn:Input - hssn:Interface 
m i hssn:UserGoal 
H i 
P A E E E E TA 
Mao) | 
hasMetadata ! 


j 
[i 
lis-a 1 
| 


hssn:Metadata 


Figure 13: Service Components 
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4.3 Data Diversity 


Audio, image, and video data can be sensed by mobile or static sen- 
sors (e.g., surveillance cameras, mobile phones). Also, in order to 
detect complex events (e.g., gunshot) a combination of multimedia 
and scalar observations is needed. Therefore, we aim to integrate 
concepts related to multimedia properties (cf. Criterion 3 in Section 
3). In MSSN [1], multimedia data/properties are integrated in SSN. 
We re-organize MSSN multimedia concepts into scalar (e.g., tem- 
perature, motion) and multimedia (e.g., noise, video) properties as 
illustrated in Fig.14. Also, we introduce in Fig.15 the mediaSenses 
and scalarSenses relationships to map sensors to their correspond- 
ing scalar and/or multimedia observable properties (cf. Challenge 
4 in Section 2). This highlights the sensor diversity in HSSN since 
static/mobile sensors can detect scalar and/or multimedia proper- 
ties. The authors in [1] also describe technical aspects/metadata of 
multimedia objects such as annotations, audio (e.g., frequencies), 
motion (e.g., trajectories), visual (e.g., color histograms). We use 
these concepts in HSSN to describe sensor observation values. A 
MediaValue in HSSN is composed of the MultimediaData concept, 
referring to the audio, video, or image objects/files and the Medi- 
aDescriptor concepts, describing the metadata of the multimedia 
objects (e.g., frequencies, colors). ScalarValues are textual (e.g., tem- 
peratures, humidity levels). Finally, we map observation values to 
their related properties using the hasMediaValue and hasScalarValue 
relationships. Sensors can now be correctly mapped to observable 
properties and observation values (cf. Challenge 4 in Section 2). 
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Figure 14: Observable Properties 
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Figure 15: Sensors/Properties 


In conclusion, new concepts and properties are introduced in 
HSSN in order to address the challenges presented in Section 2. 
Our proposal details the representation of infrastructures (a type 
of platforms) by adding location maps, individual locations, and 
spatial relations. This allows the expressively describe locations (cf. 
Challenge 1). In HSSN we describe devices as platforms that host 
sensors. We detail device hardware, software, and provided services. 
In addition, we add properties that help locate, track, and query 
these devices (cf. Challenge 2). HSSN also provides a description of 
sensor coverage areas and properties that map both locations and 
coverage areas to mobile/static sensors at any time (cf. Challenge 
3). Finally, we address data heterogeneity by detailing multimedia 
data objects, their metadata, and scalar data. We also map them to 
their respective sensors (cf. Challenge 4). 


5 IMPLEMENTATION AND EXPERIMENTAL 
SETUP 


5.4 HSSN Implementation 


We implemented the HSSN ontology using Protege 5.2.07. The 
files are available at http://spider.sigappfr.org/research-projects/ 
hybrid-ssn-ontology/ (External Links - Download ontology files). 
Also, a complete documentation can be found at http://spider.sigappfr. 
org/HSSNdoc/index- en.html. In the following, we detail the SPAROL 
queries used during the experimentation. Then, we describe the 
experimental setup, before discussing the obtained results from an 
accuracy, clarity, performance, and consistency standpoint. 


5.2 Illustration Example 


The challenges mentioned in Section 2 can be addressed via the 
following SPARQL queries: Platform Diversity: In order to expres- 
sively describe locations (Challenge 1) in the mall infrastructure, a 
detailed representation of location maps and locations is needed 
(Query 1). Also, covered and uncovered areas should be easily found 
(Query 2). In order to consider ad-hoc devices in the network (Chal- 
lenge 2), one should be able to query devices, their hardware (e.g., 
embedded sensors), software, and services. Query 3 shows how to 
locate a mobile device by querying its embedded sensor. Similarly, 
one could query a device based on other characteristics (e.g., battery 
status, processing power). 


Query 1: Knowing the spatial description of infrastructures 


SELECT distinct ?infrastructure ?locationmap ?location WHERE 


{?infrastructure isDescribedBy ?locationmap. ?locationmap isComposedOf ?location.] 


“https://protege.stanford.edu/ 
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Query 2: Knowing covered locations 


SELECT distinct ?location ?coveragearea WHERE {?location isIncludedIn ?cover- 


agearea.} 


Query 3: Locating mobile devices, querying device hardware 


SELECT distinct ?location ?dev WHERE {?location currentlyLocates ?sensor. 


?sensor isEmbeddedOn ?du. ?du hasExpansionCard ?hd. ?hd isRelatedToDevice ?dev.} 


Sensor Diversity: To track sensors at all times (Challenge 3), 
it is important to know current locations/coverage areas for all 
sensors (Query 4), as well as previous ones (Query 5). 


Query 4: Finding current sensor locations/coverage areas 


SELECT distinct ?location ?sensor ?coveragearea WHERE 


{?location currentlyLocates ?sensor. ?sensor currentlyCovers ?coveragearea.} 


Query 5: Finding previous sensor locations 


SELECT distinct ?location ?sensor WHERE {?location hasPreviouslyLocated ?sensor} 


Data Diversity: In order to consider data diversity (Challenge 
4), on should be able to distinguish scalar/multimedia data and 
correctly map them to sensors (Queries 6 and 7). 


Query 6: Mapping sensors to their scalar properties and observations 


SELECT distinct ?sensor ?property ?observation WHERE 


{?sensor scalarSenses ?property. ?property isScalarValueOf ?observation.} 


Query 7: Mapping sensors to their multimedia properties and observations 


SELECT distinct ?sensor ?property ?observation WHERE 


{?sensor mediaSenses ?property. ?property isMediaValueOf ?observation.} 


5.3 HSSN Experimental Setup 


Here, we did not aim to experiment SSN concepts and properties. 
We evaluated the impact of our newly added concepts (e.g., stat- 
ic/mobile sensors, infrastructures/devices, multimedia/scalar data). 
Our objectives were the following: 


(1) Accuracy Evaluation: Checks if the added concepts/proper- 
ties answer the aforementioned challenges. This query based 
evaluation highlights the impact of our extensions in over- 
coming the challenges mentioned in Section 2. 

Clarity Evaluation: Checks if the labels used to describe the 
concepts/properties are clear and unambiguous to domain 
stakeholders. The aim is to evaluate the compatibility and 
clarity of our provided description with respect to the appli- 
cation domain. 

(3) Performance Evaluation: Measures the impact of HSSN ad- 


— 
N 
— 


ditions on performance (i.e., query run time). The aim is 


to evaluate the feasibility, performance-wise, of integrating 
HSSN in sensor network applications. 

(4) Consistency Evaluation: Checks if the added concepts/prop- 
erties generate inconsistencies (e.g., anti-patterns) within 
the structure of the ontology. The aim is to evaluate the 
soundness of the ontology graph. 


5.3.1 Accuracy Evaluation. We created a population of individuals 
and ran the aforementioned queries. Then, we compared the ob- 
tained and expected results. We created two infrastructures, each 
described by a location map containing 500 locations. Then, 1000 
sensors were deployed (500 mobile/static, 500 scalar/media). Each 
sensor is located in one location, covers one coverage area, observes 
one property, and produces one observation value. 

Platform Results:We ran queries 1, 2, and 3. The returned results 
match perfectly the expected ones. Infrastructures were correctly 
assigned to their location maps and included locations. This al- 
lowed the identification of distinct spaces/areas. Query 2 correctly 
returned the set of distinct locations included in each coverage area. 
This allowed the identification of non covered locations. Query 3 
allowed the identification of device hardware related to the embed- 
ded sensors. Also, the mobile devices were correctly located in the 
location map. 

Mobility Results: We ran queries 4 and 5 on the population of 
individuals and for each case the returned results matched exactly 
the expected ones. Sensors were correctly assigned to their curren- 
t/previous locations and coverage areas. 

Data Results: We ran queries 6 and 7 and obtained an exact match- 
ing between the actual and expected results. Thus, scalar/multi- 
media properties were correctly distinguished. Also, sensors were 
correctly assigned to the scalar or multimedia observations that 
they produced. 

Result Discussion: The test results showed that locating any type 
of sensor (i.e., simple node/multi-sensor device, static/mobile sen- 
sors, and scalar/multimedia sensors), and knowing their coverage 
areas is possible at any point in time. Hence, allowing tasks such 
as tracking mobile sensors, and detecting uncovered areas. Also, 
the results showed that the detailing of infrastructure and device 
descriptions (platform diversity) allowed a better knowledge of 
the environment space (also important for locating sensors). Multi- 
sensor devices were also detailed by describing their hardware 
and software which proved useful when querying devices based 
on their capabilities (e.g., we ran an additional query that returns 
sensors/devices with good battery status). From a data diversity 
standpoint, the results showed that sensors that sense multimedia/s- 
calar properties were correctly distinguished and their observations 
were accurately retrieved. To conclude, the query results confirmed 
that the added extensions (i.e., regarding sensor, platform, and data 
diversity) accurately answer the challenges mentioned in Section 2. 


5.3.2 Clarity Evaluation. We created two evaluation forms: the 
first? for evaluating the ambiguity of the labels used to describe 
the HSSN concepts, and the second‘ for evaluating the ambigu- 
ity of the labels used to describe inter-concept relations. We sent 


3Link: https://goo.gl/forms/blcSpKLLqtNtjXHI2 
^Link: https://goo.gl/forms/KNNY3XsmGp0ptM2N2 
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the two forms to 50 sensor network and ontology experts (25 net- 
working experts, and 25 computer scientists). Results in Fig.16 and 
17 show that terms considered clear by computer scientists are 
sometimes found ambiguous by network experts and vice-versa. 
Fig.16 shows that a few terms do not meet the acceptable ambiguity 
level (e.g., ComUnit, DeployUnit), while others (e.g., MediaProperty, 
MediaValue) need some clarification. Therefore, we considered the 
experts’ suggestions in the final version of the ontology by modi- 
fying the following: (i) ExpansionCard instead of DeployUnit; (ii) 
PowerSupply instead of PowerUnit; (iii) NetworkInterface instead of 
ComUnit; (iv) Memory instead of StorageUnit; (v) Processor instead 
of ProcessingUnit ; and (vi) Multimedia instead of Media. Finally, 
Fig.17 shows that in most cases, both categories of experts assigned 
correctly the inter-concept relationships. Networking experts have 
low success on the first two questions since the latter are outside of 
their domain of expertise (regarding inheritance between concepts). 
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Figure 17: Property Evaluation 


Result Discussion: The clarity evaluation allowed the identi- 
fication and correction of ambiguous/unclear labels that we used 
to describe our added concepts/properties. In the version currently 
available online, all labels achieve an acceptable level of clarity 


(based on the stakeholders' feedback). This reinforces the re-usability 


of HSSN since it is unambiguous and easily understood. 


5.3.3 Performance Evaluation. In order to evaluate the performance 
of HSSN, we measured the query run-time by running each of the 
previously mentioned queries 10 times and calculating the average. 
We varied the size of the population (100 sensors, 1000 sensors, and 
10000 sensors) in order to test various scenarios related to mobility, 
platforms, and data. 

Mobility impact: In this test, we varied the percentage of mobile 
sensors in the network (0, 30, 50, 70, and 100 %). Then, we retrieved 
the current/previous sensor locations (cf. Fig.18 and 19). We mea- 
sured the run-time for queries 4 and 5. In Fig.18, we noticed that 


increasing the number of mobile devices increases the time required 
to retrieve current sensor locations. This is due to the fact that lo- 
cating a device (Query 3) was a more complex task than locating a 
static sensor since we needed to locate the sensor, its deployment 
unit, hardware, and then the device. We noticed the same pattern 
for all three cases (100, 1000, 10000 sensors). Finally, the progression 
from 0% to 100% mobile devices had a quasi-linear impact on query 
run-time. Similarly, Fig.19 details the query run-time for retrieving 
previous different sensor locations. Since mobile sensors have a 
larger list of previous locations in comparison with static sensors, 
increasing the mobility percentage (0, 50, 100 %) increases the query 
run-time. This progression was also quasi-linear for all three cases 
(100, 1000, 10000 sensors). 
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Figure 18: Mobility impact on current location retrieval 
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Figure 19: Mobility impact on previous location retrieval 


Platform impact: In this test, we varied the sensor distribution 
on the platform locations. We tested three different scenarios (i) 
each sensor is located in one location; (ii) all sensors are located in 
one location; and (iii) half of the sensors are located in a location 
and the other half in another. We measured the run-time of the 
query that retrieves sensor locations. 
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Figure 20: Platform impact on current location retrieval 


Fig.20 shows how sensor distribution on locations affected the 
time needed to map sensors to their current locations. When all 
sensors were located in one location, the required time to perform 
this task was minimal. Then, as we began to decrease sensor densi- 
ties, the query took more time. Finally, the worst case was when 
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every location contained only one sensor. 
Data impact: Here, we checked the impact of scalar/multimedia 
data on the run-time of queries 6 and 7 (cf. Fig.21). 
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Figure 21: Data impact on observation retrieval 


For data diversity impact on performance (cf. Fig.21), we noticed 

that in all cases (100, 1000, 10000 sensors) the query run-time was 
similar when considering scalar and multimedia data. This is due 
to the fact that we were measuring the time required to retrieve 
the data and not the time needed to capture/sense it. 
Result Discussion: The performance evaluation showed that the 
added concepts/properties do not heavily impact the query run 
time, which remains quasi-linear in most cases. This highlights 
the feasibility of using of HSSN in sensor applications (from a 
performance point of view). 


5.3.4 Consistency Evaluation. In [15], consistency is defined as a 
criterion that verifies if the ontology allows contradictions. The 
descriptions in the ontology should be consistent. 

Consistency Queries: To evaluate consistency, we adopted the 
following SPARQL queries that search for anti-patterns, a strong 
indicator of inconsistencies, in the ontology. Query 8 detects con- 
cepts with no parent, and query 9 detects abnormally disjointed 
concepts in the ontology: 


Query 8: Searching for concepts with no parent 


SELECT ?a WHERE {?a subClassOf owl:Nothing.} 


Query 9: Searching for abnormally dijointed concepts 


SELECT distinct ?A ?B1 ?B2 ?C1 WHERE 
{?B1 subClassOf ?A. ?B2 subClassOf ?A. ?C1 subClassOf ?B1. ?C1 disjointWith ?B2.} 


Results & Discussion: We found no inconsistencies in the 
HSSN ontology structure. The only concept subsuming nothing 
is owl:Nothing (Query 8). Query 9 results indicate that there are 
no concepts that have abnormal disjoint relations with their rela- 
tives. This denotes the soundness of the integration of newly added 
concepts mainly with the SSN core. Finally, to conclude the inconsis- 
tency evaluation, we ran Protege's HermiT 1.3.8.413 reasoner, and 


found no inconsistencies between the asserted class hierarchy and 
inferred one. This highlights the soundness of the graph structure, 
which proves critical when considering future alignments between 
HSSN and other ontologies (e.g., that describe smart buildings, 
events). 


6 CONCLUSION & FUTURE WORK 


Many works adopted ontologies for better semantic representation 
of sensor networks. These approaches do not fully consider diver- 
sity in terms of sensors, data, platforms, and application purposes. 
In this paper, we propose an extension of the Semantic Sensor 
Network ontology (SSN), since it is already re-usable in various 
contexts. Our proposed ontology, denoted HSSN, adds to SSN sen- 
sor mobility, and multimedia data related concepts in order to have 
a representation of hybrid sensor networks. HSSN also extends 
the platform representation of SSN in order to fully consider plat- 
form diversity. We implemented HSSN, evaluated the consistency, 
accuracy of our additions, and their impact on performance. As 
future work, we would like to continue the ongoing evaluation 
of the completeness of the ontology through comparisons with 
mobility and sensor taxonomies. Finally, we want to represent a 
sensor network in a smart environment (e.g., smart building, city) 
for event detection purposes. 
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ABSTRACT 


Web of Trust offers a way to bind identities with the corresponding 
public keys. It relies on a distributed architecture, where each user 
could play the role of certificate signer. With the widespread diffu- 
sion of social networks, the trust propagation is a matter of growing 
interest. This paper proposes an approach enabling the propagation 
in Web of Trust by means of Ethereum. The usage of Ethereum 
eliminates the necessity of single-organization trusted services, 
which is, in general, not realistic. Although the information stored 
on Ethereum is public, the privacy of users is protected because 
trust chains involve only Ethereum addresses and strong measures 
are implemented to contrast their malicious de-anonymization. The 
approach relies on the usage of a smart contract for storing the 
status of certificate signatures and to manage revocations. When a 
user u wants to trust another user v, the smart contract checks the 
presence of trust chains originating from root nodes of u. 
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1 INTRODUCTION 


The widespread diffusion of social and recommendation systems 
have experienced exponential growth in recent years. These sys- 
tems offer very attractive means of social interactions and commu- 
nications, but also threats for security concerns. Confidentiality, for 
example, is weakened by the lack of key management frameworks 
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that are able to bind social identities with the corresponding public 
keys. Consequently, the risk of malicious events is very high. For 
this reason, we believe that more effective solutions and mech- 
anisms are required when users, in open environments, rely on 
public key encryption to obtain security services. For example, a 
user should get answers to the following questions: “is the person I 
am talking to really the one she/he claims to be?”, “who ensure the 
trust level of my recipient?”, “is there somebody embodying the 
recipient?” 

The design of a central authority, which is trusted by everyone, is 
often not applicable in these contexts. On the contrary, Web of Trust 
[13] ensures a higher level of flexibility since it adopts a distributed 
approach that better suits the nature of the context we are referring 
to. Indeed, Web of Trust offers a way to bind identities with the cor- 
responding public keys in the form of certificates without relying 
on central authority and exploiting the direct trust between users. 
In Web of Trust, users have the capability to sign each other’s cer- 
tificates (i.e., the couple identity - public key), and this mechanism 
originates a directed trust graph in which arcs represent signatures. 
When a user needs to obtain information about a certificate issued 
by an unknown user, he/she has to check for the presence of one 
or more trusted parties in the list of signatures associated with that 
certificate. 

Although Web of Trust allows in principle trust propagation, its 
direct implementation into the current architecture would require 
either the adoption of certificates with size exponentially growing 
with propagation or trusted servers to which users delegate trust 
chain verification. With no propagation, no trust is required for 
servers. Distributed ledgers offer a solution to avoid trusted cen- 
tral authorities and to guarantee the storage of shared information 
in an immutable and distributed way. In this paper, we propose 
an approach that enables trust propagation in Web of Trust and 
exploits Ethereum to work as a public key infrastructure holding 
the list of signatures and to implements trust management. This 
approach matches the current state of Pretty Good Privacy (PGP) 
[23] public key server infrastructure. The result is a system where 
users can sign certificates of other users and can retrieve informa- 
tion about the trust level associated with a certificate by consulting 
the blockchain. Moreover, the proposed solution does not require 
the disclosure of social identities both for the signature and for the 
verification of trust phases, since it is based on users' pseudonyms 
given by the corresponding Ethereum addresses. 

The paper is structured as follows. Section 2 gives some back- 
ground about PGP and Web of Trust, motivating our proposal; 
Section 3 offers details about our solution; Section 4 discusses the 
implementation issues; Section 5 gives some conclusive remarks 
and offers hints for future works. 
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2 BACKGROUND AND MOTIVATION 
2.1 Web of Trust 


Nowadays, we are spectators of an incredible growth of online 
social networks (OSNs). Unfortunately, at the same time, risks and 
attacks towards these systems are increasing [3, 7]. The risks as- 
sociated with OSNs are of a different type. In this paper we focus 
on the trust propagation, that is orthogonal and complementary 
to some other problems such as the identification of fake accounts 
to contrast social attacks. In fact, a lot of works can be found in 
this area ([4, 7, 21] to cite a few). Instead, in this work, we propose 
an approach that enables trust propagation in the Web of Trust 
for secure communications by exploiting Ethereum and the smart 
contract properties. 

One of the most known methods that provide a mean to trust the 
association between identities and public keys is the ^Web of Trust". 
It has been firstly proposed in the context of Pretty Good Privacy 
(PGP) in 1991 by Phil Zimmermann [23]. PGP offers authentication 
and privacy protection for data communication and Web of Trust 
which consists of a decentralized method for certifying a given PGP 
public-key certificate. Indeed, people can sign each other's public 
key so that progressively, and dynamically, they create a network 
of interconnected links and signatures [1]. When someone needs 
to trust the public key of an unknown user, she/he has to verify the 
set of signatures and of the signatures' identities associated with it. 
The result is that each person will have a different and subjective 
idea of the other one based on signatures of her/his certificate. This 
is due to the figure of “introducers”, who are the trusted people 
whose signature represents a trustworthiness guarantee for a PGP 
certificate. The set of introducers is chosen by each person, so 
influencing the personal perception of the network identities. 

In this context, it is fundamental to compute and manage, in a 
timely manner, the trustworthiness of both PGP public-key certifi- 
cate and introducers. According to [1], the former can have three 
levels of trustworthiness (i.e., undefined, marginal and complete), 
while the latter can be fully, marginal, untrustworthy or don't know 
trusted. Moreover, since Web of Trust is generally subjective, every 
user is able to resolve her/his scepticism by tuning suitably two 
thresholds that are related to the minimum number of introducers' 
signatures that need to present a certificate to be considered as 
complete by a user. 


Alice Justin Bob 


Figure 1: Signatures in Web of Trust 


To explain the basic idea behind the Web of Trust, let us consider 
the scenario depicted in Figure 1. A total of 6 users are intercon- 
nected by a set of edges representing signatures of public keys. The 
origin on an edge represents the signer, while the destination is the 
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signed party. The edge is bidirectional when a mutual signature is 
present. For example, Charlie signs the certificates of Alice, David, 
Erin and Justin, while Bob is signed by David and Erin. The reader 
can deduce the other signatures by following the edges. 

In this scenario the verification of a user's public key can be 
asked from each user after the definition of her/his own set of 
introducers. For sake of simplicity, we do not consider the levels of 
trustworthiness. It means that, for example, Alice could trust the 
Bob's public key if David and/or Erin belong to the set of Alice's 
introducers and she has settled for 1 signature. If she needs at least 
2 signatures, both David and Erin needs to belong to her set of 
introducers; if she needs more than 2 signatures to trust a user's 
public key, she will never trust Bob's public key. 

Furthermore, what happens if Alice wants to verify Bob's public 
key and her introducer is only Charlie? In our previous example, 
there is a path of signatures (of 2 steps) from Charlie to Bob. PGP 
is equipped with a parameter, named CERT. DEPTH, to establish 
the maximum length of the certification chain; unfortunately, this 
value is often not used in real applications since it can be quite 
difficult to use in an appropriate way [1, 16]. This led to decrease 
in the interest of researchers in working on new solutions for the 
propagation of trust. Anyway, since we are talking about Web of 
Trust, we think it is necessary to enable propagation to make the 
most of PGP, especially in OSN applications where the number of 
users is very high. In this sense, we can start from the assumption 
on “objectiveness” of trust, such that, if Alice fully trusts Charlie 
as her introducer, she is trusting, at the same time, his actions 
and his decisions. As a consequence, if Charlie considers David 
trustworthy, then Alice will consider David trusted as well. 

In a standard PGP scenario, enabling propagation leads to keep 
an updated and trusted certification chain in each PGP member's 
certificate, and, moreover, it requires to store full trust chains on 
PGP servers. This is, clearly, totally in contrast with the Web of 
Trust principles. For this reason, the rest of the paper will describe 
our solution that exploits Ethereum instead of PGP servers. 

To the best of our knowledge, few works proposed ways to 
enable propagation in Web of Trust or use a blockchain-based solu- 
tion to implement PGP and trust evaluation mechanisms. In [15], 
for example, the authors proposed a way to further expand the 
trusted neighbourhood. The basic idea relies on the possibility to 
sign a user's certificate with the couple of values (4-1, —1}, where 
—1 represents a signer that believes the certificate is not authentic. 
Starting from these values, the authors provide the necessary for- 
mulas evaluating a user's feedback. Other existing works proposes 
the usage of blockchain to secure Trust Management system for 
authentication. For example, in [2] the authors formally model such 
systems as trust graphs and explore how the usage of a blockchain 
can mitigate attacks. In [18] the authors proposed a formula to cal- 
culate the trust degrees between two users in an e-commerce as a 
combination of direct and indirect trust degrees. Starting from this 
formula, the authors in [6] exploits bolckchain as a mean to store 
the necessary information. The work in [22] proposes a framework 
supporting fast propagation of certificate revocation and elimina- 
tion of man-in-the-middle risk by using blockchain. With respect to 
all these works, our proposals exploit blockchain to store certificate 
signatures and smart contract to verify the presence of trust chains 
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enabling propagation in Web of Trust. The proposed solution also 
contrasts the malicious de-anonymization of social identities, in 
fact the trust propagation stores only information about Ethereum 
addresses. Moreover, the introduction of a smart contract does 
not ask clients to be full nodes (i.e., a node that stores the full 
blockchain and participates in block validation) and to perform 
expensive computations to navigate the trust graph. The smart 
contract also introduces economic disincentives to malicious be- 
haviours (such as Denial of Service and Sybil attacks). In the next 
section, we describe the details of our Ethereum-based proposal for 
enabling propagation in Web of Trust. 


2.2 Ethereum 


Since we need a trusted and decentralised mechanism to overcome 
these issues, we decide to implement a blockchain-based solution 
enabling the propagation in Web of Trust. In particular, we use 
a platform based on blockchain called Ethereum [5, 9, 20], which 
allows the development of DApps (Decentralised applications) that 
requires to interact each others in a secure and fast way. Ethereum 
utilises the distributed ledger model by purposing it to model a 
virtual computer. Indeed, it provides a decentralised virtual machine 
(Ethereum Virtual Machine - EVM) capable of executing code (the 
so-called Smart Contracts). 

In Ethereum, we can distinguish between two kinds of accounts: 
(1) Externally Owned Accounts (EOAs); (2) Contract Accounts (i.e., 
Smart Contracts). The former are controlled by private keys, while 
the latter are controlled by the code of the contract itself. At the 
moment, the only high-level and Turing-complete programming 
language that implements the EVM Bytecode is Solidity [8]. This 
programming language is used to write and develop Smart Con- 
tracts. 

As we just said, Ethereum Smart Contracts are real nodes of 
the network like EOAs and are used as agreements between users 
who do not trust each other. Indeed, they exploit the decentralised 
and distributed consensus mechanism of Blockchain that do not 
requires any Trusted Third Party (TTP). 

More in detail, this mechanism is established by mining based on 
the proof-of-work (PoW) scheme. The PoW assumes that the winner 
miner is that one who solves first some mathematical puzzles. The 
average time for mining a block of transactions is about 10 — 12 
seconds. As a consequence, new branches of the principal chain 
are generated very often and it is necessary to manage these forks 
to guarantee a certain level of security and decentralisation of the 
mining process [10]. For this purpose, Ethereum implements a 
simplified and modified version of the protocol GHOST (Greedy 
Heaviest Observed Subtree) in such a way also *uncles" nodes are 
partially considered in the computation of which block has the 
largest and heaviest total proof-of-work backing it. 

Another feature of smart contracts is that on of accessing easily 
data that is stored on Ethereum. Moreover, they can also process, 
edit or write new data. Finally, it is important to underline that 
the code of smart contracts can be executed by every node of the 
blockchain. 

One of the most relevant and successful property of Ethereum 
regards tokens. In this environment, a token is a particular cryp- 
tocurrency that has no value until someone or something (e.g. the 
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crypto-market) gives it to it. Usually a company or a single-person 
that decides to implement a new token via smart contract publishes 
the business idea in a white paper and offers the token during a 
Initial Coin Offering (ICO) period [11]. 


3 DESCRIPTION OF OUR PROPOSAL 


The core of our proposal resides in the representation of a Web-of- 
Trust-based trust model, allowing the propagation of trust among 
a domain of public key associated with real-life identities. Even if 
we assume that real-life identities operate on an OSN context, the 
proposed solution is general enough to be applied in other contexts. 
The user needs to publish their public keys (e.g., on their own 
social network profiles). We remark that the focus of this proposal 
does not regard the problem of impersonation and fake profiles in 
social networks, for which a wide literature exists, and the existing 
approaches and techniques can be orthogonally applied together 
with our solution. 

According to the Web of Trust model, every user u elects a 
number of introducers, who are persons objectively trusted for u. 
From the point of view of trust propagation, public keys (actually, 
certificates) associated with the introducers play the role of root 
certificates whenever u wants to verify trust paths. More formally, 
we define a set of users U and a function f; : U — 2U which, for 
any user u € U returns the set f;i(u) € 2U of the introducers of 
u. Given an user u, we denote by C,, the certificate including the 
public key associated with the social-network identity of u. 

As highlighted in Section 2, in order to avoid the necessity of 
single-organization trusted services implementing trust propaga- 
tion and certificate revocation, we leverage Ethereum. Thus, we 
refer to another domain of identities, which is composed of the set 
of Ethereum addresses. We require that any user u, to participate 
in the Web of Trust, is able to associate her/his certificate Cu with 
an Ethereum address ETH,. The idea is that the trust graph is built 
via an Ethereum smart contract SC, it is stored also into the state 
of SC and it is managed through the functions of the same smart 
contract. Let denote by ETHsc the Ethereum address of SC. The 
smart contract SC includes the following functions: sign, veri f y, 
and revoke. The exact definition of these functions will be explained 
throughout this section. The status of SC is composed of a directed 
graph Gs of Ethereum addresses, a list L, of revoked Ethereum ad- 
dresses, and a data structure storing the number of failing attempts 
of Ethereum addresses if not null (this point will be explained in 
the Trust Verification process below). 

We represent Ethereum transactions as tuples (src address, rec- 
ipient address, data), where src address denotes the Ethereum 
address of the sender, recipient address denotes the Ethereum ad- 
dress of the receiver, and data is the field including additional 
information (allowed in Ethereum). In our representation, we do 
not make explicit the fact that any transaction is signed by the 
sender by means the Ethereum private key, and we omit some in- 
formation related to specific features of Ethereum (e.g. GAS price). 
We highlight that we do not define a generic representation of 
smart-contract events because the structure of any event can be 
defined by the smart-contract designer in terms of both structure 
and content. In the following, we list the content of events into 
tuples. 
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Now, we describe how we map, in our model, the basic functions 
of Web of Trust, which are certificate signature, trust verification, 
and certificate revocation. We want to highlight that each of the 
following operations calling the smart contract can be performed 
only by those Ethereum addresses that have not been revoked 
yet since the smart contract will filter all the illegitimate requests 
received. 

For the sake of clarity, when we say that a user u trusts another 
user v, we mean that u obtains by our trust infrastructure the infor- 
mation that the user v (actually, her/his certificate Cz) is reliable. 
According to Web of Trust, we consider three different levels for 
trust, that are COMPLETE , MARGINAL , UNDEFINED , for decreasing 
reliability. To be realistic, we assume that users trust only recipro- 
cally because we consider that trust is required as a preliminary 
step of an interaction between two users who do not know each 
other. So, when a user u signs a certificate Cy, we say that u gives 
trust to v. 


Certificate Signature. This phase is carried out when a user 
u wants to sign the certificate C; of another user v. She/he has to 
generates an Ethereum transaction T, = (ETH,, ETHsc, ETHy) 
calling the function sign of the smart contract SC, whose effect is 
to update the status SC by inserting the arc (ETH, ETH,) in Gs, 
provided that ETH, is not in L,. Observe that, as said before, this 
operation can be carried out only by those Ethereum addresses who 
are not in the revocation list Ly. 


Trust Verification. The goal of this process is to obtain that 
the users u and v know the reciprocal Ethereum addresses and, at 
the same time, trust each other. Observe that the second result is 
reached only if, once the Ethereum addresses have been disclosed, 
the smart contract verifies that the trust paths starting from those 
addresses satisfy the policies required by the users. 

To discourage malicious attempts of users aimed to only discover 
the association between a social profile and the corresponding 
Ethereum address, this procedure can be started only by users who 
demonstrate to own a sufficient amount of trust k (in terms of 
number of signatures of their certificate). This measure resumes in 
some sense the Proof of Stake protocol [19] In our case, malicious 
behavior is prevented because the user risks her/his trust. Indeed, 
the smart contract will revoke Ethereum addresses after a certain 
number n of failures. For us, an user u fails when she/he does not 
satisfy the policy Py of the user v. 

So, in this trust verification process, u and v send the following 
transactions to SC. u sends the transaction T, = (ETH,, ETHsc, D- 
atay), where Data, is the following tuple: (fi(u), Ru, R, Pu), in 
which R, denotes the result of the encryption with the public key 
included in C, and the signature with the private key of u (thus 
associated with to the public key included in Cu) of a random 
number R exchanged previously by social-network (out-of-band) 
interaction between u and v, R is the random in clear text (required 
to link this transaction with the other one generated by v), and Py 
is the policy required by u such that v can be considered trusted. 

In turn, v sends the transaction Te = (ETHy, ETHsc, Datay), 
where Data; is the following tuple: (fi(v), Ru, R, Py), where where 
Ro denotes the result of the encryption with the public key included 
in C, and the signature with the private key of v (thus associated 
with to the public key included in Cz) of the same random number R 
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exchanged previously by social-network (out-of-band) interaction 
between u and v, R, again, is the random in clear text (required to 
link this transaction with the other one generated by u), and P; is 
the policy required by v such that u can be considered trusted. 

More in detail, R, and R; represent the challenges aimed to 
prove that the Ethereum addresses ETH,, and ETH, are owned by 
u and v, respectively. The signature guarantees the (source) authen- 
tication of the challenge, while the encryption of R guarantees the 
confidentiality of the link between Ethereum address and social 
network profile. 

The effect of the above transactions (for any u) is to call the 
function verify of SC that works as follow: 


first, it checks that the sender u is not in L;; 

e if u is not revoked, then it verifies that C,, has been signed 
for a number of times greater than the threshold; 

if yes, the function stores the mapping between ETH, and 
R; 

e for every couple (ETH,, ETH,) of ethereum addresses mapped 
with R, the function checks for trust paths, starting from the 
ethereum address of the other one v, and computes and re- 
turns the trust level based on f;(u) and on the policy required 
by u (see Section 4 for further details) by emitting an event 
only in case of satisfaction of such policy; 

indeed, if the policy of u has not been satisfied, then the 
smart contract updates its status by increasing the value of 
failures of v. 


In particular, events have the following structure : (ETH,, ETH», 
Ro, R, Tuo), where ETH, and ETH, represent the couple of users 
u and v linked by R, and Tuv is the trust, computed by the smart 
contract, that u has with respect to v. 

Now, each user who called the function listens on the blockchain 
for events having her/his R. When she/he finds it (or them), u 
computes the decryption of Rz in such a way she/he can have con- 
firmation that v is actually the one with whom she/he is interacting 
on the social network. 

In our architecture, it is possible that a malicious user z may 
try to carry out a man-in-the-middle attack since the random R is 
clearly shown. For mitigating this risk, we add some disincentives 
in our model. In fact, if it is real that z can generate a transaction 
with R in such a way the smart contract finds the couples of users 
(u, z) and (v, z) as well as the legitimate couple (u, v), it is real also 
that: 


e the attacker z must have a corresponding certificate Cz 
signed at least k times, 

e if the attacker z does not satisfy the policy of the victim, the 
smart contract will update its status in terms of number of 
failures attempts of z and that, when this counter will be 
greater than n, her/his certificate Cz will be automatically 
revoked by adding ETH; in Ly; 

e in Ethereum, every operation costs some amount of gas, so 
z would spend gas each time she/he will try to carry out an 
attack. 


The result of these countermeasures and precautions is that 
the attacker z has lost all her/his trust and that her/his ethereum 
address will be filtered as black listed. 
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Certificate Revocation. When a user u wants to revoke her/his 
certificate, she/he has to generate a transaction to SC from her/his 
ethereum address ETH,, by calling the function revoke, which 
changes the status of the smart contract by adding ETH), to the 
list of revoked ethereum addresses L;. From this moment on, every 
trust path that passes through ETH,, will be considered as invalid 
and, moreover, if another user w will have ETH, in her/his fi(w), 
the smart contract will filter this list of introducer by removing 
ETH,, (and possibly others revoked). 

Furthermore, as we said before, there is the case in which Certifi- 
cate Revocation is carried out automatically by the smart contract 
to prevent DoS, replay attacks, and so on. In particular, this Cer- 
tificate Revocation happens when a user fails the Trust Verification 
phase more than n times. 


4 IMPLEMENTATION ISSUES 


After seeing the description of our proposal, let's move on to some 
implementation details. First, in Section 3, we introduced the policy 
P, of the user u that must be satisfied in order to proceed with 
the event emission from the function verify of the smart contract 
SC. In particular, with P,, we intend a function based on the two 
well-known parameters of Web of Trust (i) COMPLETES. NEEDED and 
MARGINALS. NEEDED [1]. These two parameters work as thresholds, 
in the sense that they define the number of full trusted introduc- 
ers or marginal trusted introducers needed to reach the desirable 
trustworthiness of the certificate. More in detail, Figure 2 depicts 
the declaration of the function verify which corresponds to the 
Data, field described in Section 3. 


pragma solidity 0.5.7; 
contract SC ( 


function verify(address to, address[]  introducers, bytes32 
ciphered random, uint256 random, uint256 completes needed 
, uint256 marginals_needed) { 


Figure 2: Declaration of the function verify 


Furthermore, it is important to give more details about how we 
effectively propagate trust. As said before, we want to remark that 
trust should be considered more objective than it is, because if I give 
trust to another person, then I am giving trust to her/his decisions 
too. 

Anyway, it is also realistic to think that, during propagation, 
the trust value should decrease after a certain number of hops. 
Moreover, even if we implement a smart way to store and manage 
the trust graph in the smart contract, since every operation carried 
out by it is onerous, we apply the theory of the small world and 
six degrees of separations [14, 17] and of the so-called horizon of 
observability, which consists of a value, deriving from network 
theory which is, in turn, related to the FOAF (Friends of a Friend) 
concept, that oscillates between two and three [12] in the following 
way: 
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e if the hop counter needed to reach my introducers is less 
than the horizon of observability (that is equal to 3), then 
the trust level propagates without any decreasing; 

e instead, if the hop counter needed to reach my introducer is 
greater than 3, the trust value decrease (from full to marginal 
and from marginal to don't know); 

e in particular, when the hop counter reaches a value equal to 
six (like the degrees of separations), the algorithm stops in 
order to avoid to spend too much gas. 


pragma solidity 0.5.7; 
contract SC ( 
event trust satisfied(address indexed from, address indexed 


_to, bytes32 ciphered random, 
string t value); 


uint256 indexed random, 


Figure 3: Declaration of the event trust. satisfied 


After explaining how we propagate trust, let's move on the 
emission of the event. As shown in Figure 3, we called the event 
trust. satisfied and it has all those parameters necessary to 
communicate that the phase verification of the trust has been suc- 
cessful. Observe that, the keyword indexed allows filtering queries 
on all events logged by the smart contract with respect to those 
parameters preceded by this keyword. 


ETH; bn 


ETH,, ETH, ETH,, ETHg +++ —p | ETH; , ETH) 


Figure 4: Store schema 


To store the information about the trust graph, the smart contract 
internally has a high storage capacity but it has to organize data in 
a static array. Figure 4 depicts the data structure used by the smart 
contract for the trust graph. A static array with a length n is used 
(left of the figure). We associate a list of blocks, each one containing 
a couple of Ethereum addresses, to each element of this array. When 
the smart contract has to store the information about a trust from 
the Ethereum address ETH; to ETHj, the smart contract enters 
the list of blocks addressed by the mod (modulo) operation on the 
address ETH; and appends the new block made of the couple « 
ETH;, ETH; > to the list. Intuitively, a high value for the dimension 
n reduces the number of collisions among Ethereum addresses, 
increasing the needed data space. This data structure represents 
also an efficient system in which to search for the addresses signing 
an Ethereum address ETH;. The smart contract has to scan the list 
associated with the i mod n location and search for ETH; in the 
second element of the blocks. 

The same data structure is also used inside the smart contract to 
store both the revocation list L, and the number of failing attempts 
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of Ethereum addresses. For both these cases, the key to access the 
array is represented by the Ethereum address to store modn, while 
the blocks are structured as a couple of Ethereum address and a 
boolean flag for the revocation list, Ethereum address and a counter 
for the number of failing attempts. 


5 CONCLUSIONS AND FUTURE WORK 


In this paper, we propose a solution, based on Ethereum and smart 
contracts, to enable propagation in Web of Trust. In particular, 
our solution exploits Ethereum to avoid the necessity of single- 
organization trusted services. Moreover, the privacy of users is 
protected since the proposal uses Ethereum addresses to store trust 
relationships. Strong measures are also proposed to contrast mali- 
cious de-anonymization of the couples addresses-identity. A smart 
contract stores the current status of certificate signatures and man- 
ages revocations. The usage of a smart contract simplifies the client 
operations, that do not perform expensive computations to navi- 
gate the trust graph. A smart contract represents also an economic 
disincentive to malicious behaviours. 

As future work, we plan to apply our solution in different do- 
mains in which trust propagation is required. Furthermore, we will 
apply formal verification techniques to formally prove the security 
level of our solution. At last, we will try to adopt more effective 
challenges to prove the ownership of Ethereum addresses without 
the possibility of the social identity disclosure. 
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ABSTRACT 


The Medical Informatics Platform (MIP) of the Human Brain Project 
(HBP) is tasked with providing its users diverse high quality clini- 
cal data and tools for medical analysis, while complying with the 
national legislation about privacy and security. Data, which is pro- 
vided by a large number of hospitals, tends to be heterogeneous 
and also has a constantly changing schema, due to hospitals’ need 
to capture more information. In this paper we provide a look in 
the MIP’s data ingestion pipeline and focus on steps taken by our 
team to properly integrate clinical data from heterogeneous sources 
while ensuring its quality throughout the processing pipeline. We 
have developed tools both for meta-data management and quality 
control. 


CCS CONCEPTS 


- Information systems — Version management; Information 
systems applications; Data management systems; Information 
integration. 


KEYWORDS 


Database Management, Meta-Data Management, Quality Control, 
Schema Matching, Data Integration, Clinical Data, MIP, Medical 
Informatics Platform 


ACM Reference Format: 

Admir Demiraj, Kostis Karozos, losif Spartalis, and Vasilis Vassalos. 2019. 
Meta-Data Management and Quality Control for the Medical Informatics 
Platform. In 23rd International Database Engineering & Applications Sym- 
posium (IDEAS’19), June 10-12, 2019, Athens, Greece. ACM, New York, NY, 
USA, 9 pages. https://doi.org/10.1145/3331076.3331088 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than the 
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or 
republish, to post on servers or to redistribute to lists, requires prior specific permission 
and/or a fee. Request permissions from permissions@acm.org. 

IDEAS’19, June 10-12, 2019, Athens, Greece 

© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 
ACM ISBN 978-1-4503-6249-8/19/06...$15.00 
https://doi.org/10.1145/3331076.3331088 


Kostis Karozos 
karozos@aueb.gr 
Department Of Informatics 
Athens University of Economics and Business 
Athens, Greece 


Vasilis Vassalos 
vassalos@aueb.gr 
Department Of Informatics 
Athens University of Economics and Business 
Athens, Greece 


1 INTRODUCTION 


There is a common understanding the human brain has been one 
of the most compelling and difficult tasks that humanity has un- 
dertaken. The possible implications from such an endeavour are 
countless and would have a huge impact on many aspects of our 
society. The HBP [2] is a large 10-year scientific research project, 
that aims to create an infrastructure of cutting edge technology, 
in order to allow researchers to advance the fields of brain-related 
medicine, computing and neuroscience. In order to achieve such 
an ambitious goal six Information and Communication Technology 
(ICT) platforms have been created: the Neuroinformatics Platform!, 
the Brain Simulation Platform?, the High-performance Analytics 
and Computing Platform?, the Medical Informatics Platform? [2], 
the Neuromorphic Computing Platform? [5] and the Neurorobotics 
Platformé [7]. 

For this paper we will be focusing on the Medical Informatics Plat- 
form (MIP), that aims at understanding disease clusters and their 
respective disease signatures through local and federated analysis 
of data, residing in a wide network of hospitals. The federated anal- 
ysis is a crucial component of the project as it combines analytical 
results from data stored in various sources, while guaranteeing 
that no sensitive information is leaked out of the facilities of the 
hospitals. By overcoming the issues of privacy preservation that are 
imposed by national legislation and institutional ethics, the project 
intents to encourage the scientific community to experiment with 
records that until recently had been totally inaccessible. 

In the medical field, providing high quality information is of utmost 
importance in order to have a valid analysis. This task becomes 
especially difficult when considering that data resides in multiple 
sources and is heterogeneous, containing both electronic health 
records (EHR) and imaging features. In order to make the analysis 
feasible across all hospitals, data that is provided has to be matched 


‘https://www.humanbrainproject.eu/en/explore-the-brain/neuroinformatics- 
platform/ 
?https://www.humanbrainproject.eu/en/brain-simulation/brain-simulation- 
platform/ 

*https://www.humanbrainproject.eu/en/hbp-platforms/hpac-platform/ 
^https://www.humanbrainproject.eu/en/medicine/medical-informatics-platform/ 
Shttps://www.humanbrainproject.eu/en/silicon-brains/neuromorphic-computing- 
platform/ 

®https://www.humanbrainproject.eu/en/robots/ 
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to a unified global schema. We will be referring to the elements of 
the global schema as Common Data Elements(CDEs’). The CDEs, 
which describe the knowledge about patients’ brain structure, di- 
agnosis, clinical tests results as well as genetic and demographic 
information in a clear and straightforward way, are a product of 
HBP’s clinicians’ and researchers’ scientific work. The process of 
aligning the local variables to the CDEs is usually referred as a map- 
ping task and is done with the use of MIPMap [11] [21], a schema 
mapping and data exchange tool. In many cases the schema of the 
hospital variables changes due to errors or empty values or simply 
because we want to incorporate more information. Moreover, the 
global schema is also updated because as more hospitals are pro- 
viding their data we find out that we need to extend it to contain 
more elements that were not encountered before and are present 
in many new hospitals. Concluding, this means that we need to 
map the changing schemas of variables from different hospitals, 
to a changing global schema and each change has to be commu- 
nicated to a large team. This complication led soon to the need to 
create a single point of reference for maintaining and implementing 
changes to the schema of both the variables and the CDEs. Another 
issue we encounter is that in many cases data that is provided by 
the hospitals is of poor quality and we have to take the decision of 
either correcting it or dropping it entirely. This means that we need 
a way to identify which records might be faulty and act accordingly. 
In this paper we present both a tool for managing the schemas of 
the data as well as a tool that provides a quality analysis for the 
actual data. We will be referring to the first tool interchangeably 
as Meta-Data Management Tool (MDMT) or Data Catalogue (DC) 
and to the second tool as Quality Control Tool (OCT). 


2 RELATED WORK 
2.1 Meta-Data Management 


Sen and Arun [18] describe the growing implications of meta-data 
in multiple fields. In their work Lauren E. Sweet and Heather Lea 
Moulaison [20] denote the importance meta-data interoperability 
for a wider medical analysis and argue that there are a number of 
issues in applying a common standard due to the complexity of 
the data. Greenberg et al. [8] note that data is created faster than 
meta-data can organize it. Curdt et al. [6] describe their experience 
in creating a meta-data management tool and argue about the 
importance of versioning. Jiang et al. [9] provide an overview on 
an end-to-end meta-data management solution. 


2.2 Quality Control 


Adlassnig et al. [1] and Shabestarial et al. [19] provide an overview 
on the requirements and challenges of quality assurance in EHR. 
Perimal-Lewis et al. [16] try to identify the key factors that influ- 
ence and determine the quality in EHR. Kerr et al.[12] show the 
best practices and potential benefits in the management of quality 
in EHR. Boyle et al. [4] created a system to evaluate the quality of 
EHR by identifying the latest valid results and the data sources that 
they were extracted from. Orfanidis et al. [15] focus on the quality 
issues that are encountered in the adaptation of EHR for managing 
the data in medical facilities. They also propose a system that will 


7the current schema of the CDE's contains 172 elements and is published in: https: 
//github.com/HBPMedical/mip-cde-meta- db- setup/blob/master/variables.json 
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implement sequential steps in order to ensure quality. Arts et al. 
[3] have identified through research several quality procedures for 
the medical registries and created a framework for their implemen- 
tation. 

Prieto et al. [17] utilize the DICOM header of radiology images to 
identify slices that were retaken and thus eliminating duplicate 
information. Kallman et al. [10] identify the changes in the patients' 
exposure and in the quality of the image by using an automated 
method to analyze data from DICOM headers. Liu et al. [13] have 
developed a quality assurance program that evaluates radiology 
interpretations based on reference standards. They create perfor- 
mance metrics for their evaluation and compare them against other 
benchmarks. 


3 DATA INGESTION PIPELINE IN THE MIP 


3.1 MIP architecture overview 


A high level overview of MIP would reveal four basic layers. Follow- 
ing a top-down approach the first layer would be the MIP Portal, 
which provides the medical researcher with a set of tools for anal- 
ysis over the data of the hospitals that has been mapped to the 
CDEs. The second layer is the Federation layer which is responsible 
for taking the requests for analysis from the Portal and executing 
them over multiple hospitals. Moreover, this layer is responsible 
for retrieving the output of the analysis from each hospital and 
aggregating the result that will ultimately be returned to the Portal. 

The third component is the Local layer, which is located in the 
facilities of each hospital and runs analysis only on local data. 
Finally, at the bottom of the architecture lies MIP’s Data Factory, 
which is responsible for providing the platform with data. Our 
contribution is focusing on the procedures ran by the Data Factory 
that we describe in depth in the next section. 


3.2 MIP Data Factory 


In (Figure 1) we present the architecture of the Data Factory. On 
the far left side we can see the hospital that collects all the available 
information from various sources (Inbound Information Systems, 
REDCap?) and provides three types of data. The first one is Brain 
Scans, that are being produced by Magnetic Resonance Imaging 
(MRI) machines and are provided in accordance with the DICOM? 
[14] integration standard. The second type is the Electronic Health 
Records (EHR), which are basically all the information a hospital 
collects from its patients. EHR may include, but are not limited 
to, demographics, medical history, medication and allergies, im- 
munization status and laboratory test results. These records have 
both a vast range as well as massive differentiation from hospital 
to hospital. Some of the reasons behind this are that the hospitals 
may differ in the equipment that they have available, the laws and 
legislation that apply in the region and the information systems that 
they are being used for storage. Finally, the last type of data is the 
meta-data, that describe the schema of the EHR. The meta-data are 
provided in a standardized format that was created for the needs of 
the project. Despite the fact that the data will be accessed within the 
environment that is provided by the hospitals, an additional level 
of security is employed through the process of Pseudonymization. 


Shttps://www.project-redcap.org/ 
?https://www.dicomstandard.org/ 
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Figure 1: The Data Factory of the Medical Informatics Platform. 


The hospital personnel decides in accordance with each hospital’ s 
policy which record columns should be eliminated of replaced with 
other values (e.g. patient name and patient id) so as the data to be 
depersonalized. 

The brain scans are passed through a pipeline, that extracts brain 
morphometric features. As for the EHR the initial step involves a 
schema matching process to fit in the tables, that we have defined in 
the first Postgres database (unharmonized). All such processes are 
implemented using MIPMap. MIPMap offers a GUI that allows the 
user to define correspondences between the source and the target 
schema, as well as join conditions, possible constraints and func- 
tional transformations. Users can simply draw arrow lines between 
the elements of two tree-form representations of the schema. The 
techniques employed for automated or semi-automated schema 
matching are prone to mistakes that due to the nature of the bio- 
medical field we cannot afford. We are processing sensitive infor- 
mation and we have to guarantee their integrity. After the mapping 
the EHR are linked with the Brain Scans and are now ready for 
the harmonization process. The harmonization process involves 
mapping the first Postgres database’s variables to the CDEs while 
processing their values so as to conform to the CDEs standards 
and measurement units. The finally harmonized data is saved in 
another database. This harmonized database is essentially feeding 
the MIP local node with data. These nodes combine the CDEs with 
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their schema, that is being offered by the MDMT, to provide data for 
analysis to the upper layers of MIP. The local node differs from the 
federated one in the manner that it is not contained in the federated 
analysis. Depending on their access rights, users can access the 
local node and run analysis only on the data of a specific hospital. 
The tools and algorithms the user can use in such an analysis are 
different from the ones he would for federated analysis. For security 
purposes an anonymization process is followed in the records that 
will be included in the federated analysis. 

In this pipeline in order to ensure the quality of the given data 
the QCT is being deployed iteratively both in the initial EHR data 
and brain scans and in the final harmonized database. The reports 
it produces are being used for the decision to either correct or 
delete certain variables. The reports are saved and can be refer- 
enced though the MDMT. The MDMT is responsible for taking 
the initial meta-data and producing a final JSON that contains a 
harmonized schema of the variables. 


4 META-DATA MANAGEMENT 
4.1 The importance of Meta-Data 


The most common description of the meta-data is "data about the 
data". We could more accurately depict it as the information re- 
quired to contextualize and understand a specific data element. In 
this paper we will be focusing on meta-data describing clinical data. 
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Nowadays hospitals and medical facilities collect a vast amount of 
information by using a variety of techniques and tools. In order 
to have an accurate depiction of this information meta-data is es- 
sential. Meta-data is being utilized in the vast majority of medical 
surveys and one could argue that it is as important as the actual 
data. The case becomes even stronger when considering the need to 
combine data from multiple sources for distributed analysis. There 
is a number of hospitals all over Europe contributing in the HBP 
and as it is expected, there are more than a few differences on 
which variables and meta-data each one is collecting and in which 
format. Thus, we need a standardized way of collecting meta-data, 
in order not only to understand the given variables, but also to be 
able to efficiently compare them. One of the main tasks of the MIP 
is mapping hospital data to a global schema. The task is inaccurate 
and time-consuming without a good understanding of the clinical 
data. The margins for errors are slim and we need to be able to guar- 
antee high quality standards. As an added benefit, the meta-data 
act as our first line of defence against any inaccuracies in the given 
variables. A simple comparison of the value of the data with the 
given meta-data specifications can prevent a number of mistakes 
from propagating to later stages of the data pipeline. 


4.2 Meta-Data Management Tool 


General Description: The MDMT is an end-to-end platform, cre- 
ated with the latest technologies and tasked with versioning the 
meta-data of the hospital variables and the CDEs. It utilizes a global 
schema for the collection ofthe meta-data and provides information 
about the mappings of the variables to CDEs. The potential users 
of the tool are first of all the researchers that prior to executing 
experiments through the Portal of the MIP may want to investigate 
what type of information is available in each hospital. Moreover, 
the tool aims at facilitating the collaboration between the autho- 
rized hospital personnel and the development team of MIP. Each 
side can implement changes by creating a new version, which will 
be reviewed by the other side and through many iterations, this 
process will lead to high quality information. As a result we will 
be able to make more accurate mappings of variables to CDEs and 
resolve cases where we did not have enough information to make a 
mapping. The data pipeline of the MIP is quite big and a more than 
one team is involved in each process, so when a change is made ev- 
eryone has to be informed. The tool is intended to be a single point 
of reference and hopefully will eliminate any mistakes originating 
from the lack of transparency in the changes implemented in the 
variables’ schema and meta-data. 

Changes in the schema of clinical data are quite common and we 
can distinguish three types of them: 


(1) Correction changes due to mistakes or to insufficient infor- 
mation in the meta-data. 

(2) Changes due to information coming from a new data source 
within the hospital. 

(3) Changes due to updating the CDEs and thus new mappings 
are possible or old ones might have to be modified. 


At any given time we can download the meta-data for the MIP's 
global schema as well as the meta-data for every hospital's schema, 
of imported (or to-be-imported) data. All these are stored in JSON 
files. These JSON meta-data files are used as input to the MIP when 
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importing new data, since the platform along with the data is in 
need of its corresponding meta-data. 


Meta-Data Information: The process of tracking the meta-data is 
initiated by each hospital that has to upload, in a specified format, 
an excel file containing all the meta-data. We have defined a set of 
meta-data variables for the hospitals in order to avoid confusions 
and to speed up the process of uploading them. This set contains 
the following information: 


e csvFile: The name of the dataset file that contains the vari- 

able. 

name: The name of the variable. 

code: The variable's code. 

type: The variable's type. 

values: The variable's values. It may have an enumeration 

or a range of values. 

e unit: The variable's measurement unit. 

e canBeNull: Whether the variable is allowed to be null or 
not. 


description: The variable's description. 

comments: Comments about the variable's semantics. 
conceptPath: The variable's concept path. 
methodology: The methodology the variable has come 
from. 


e mapFunction: The function that transforms the variable's 
value into the value of its corresponding CDE. 
mapCDE: The corresponding CDE. 


This information covers the vast majority of the meta-data we want 
to collect and is easy to adopt even by hospitals that have not im- 
plemented a standard about their meta-data. Most of the columns 
are self explanatory but we feel we should give further information 
about the conceptPath, the mapFunction and the mapCDE. The 
conceptPath essentially defines the hierarchy of the variable. It 
has a strictly defined format and always starts with the root cat- 
egory (/root) followed by the rest of the consecutive categories, 
separated by slash (/) until we reach the variable code. For example 
the conceptPath for the Mini Mental State Examination!? score is 
/root/neuropsychology/minimentalstate. The mapCDE defines to 
which of the CDEs should the current variable be matched and the 
rule that should be applied to its value for the transformation is 
expressed by mapFunction. The tool keeps the information about 
the mappings but it does not make the actual mapping transforma- 
tions (this is done by MIPMap). After the provided file is checked 
for its integrity, a new hospital entity is created containing the first 
version of the variables’ meta-data and mapping. 


Meta-Data Viewing: The user can view all versions of each hos- 
pital’s local variables. She is also able to view all CDEs’ versions. 
For every hospital’s meta-data version that is created, we also cre- 
ate a harmonized one containing CDEs and additional hospital 
local variables. Each meta-data version is depicted via four different 
views: 


(1) Flat View: A flat view includes all the variables and their 
respective meta-data. The variables are searchable by their 


10The Mini Mental State Examination (MMSE) or Folstein test is a 30-point question- 
naire that is used extensively in clinical and research settings to measure cognitive 
impairment. It is commonly used to screen for dementia. 
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Figure 2: The searchable graph displaying the taxonomy of the variables and CDEs. 


code and category. This view is convenient for checking 
variables' details. 

(2) Tree View: An interactive tree view (Figure 2) displaying 
the taxonomy of the variables. The user can expand or col- 
lapse nodes and every time she hovers over a node a brief 
description of the node is displayed. The tree is also search- 
able by variable code and category. This view is good for 
a better comprehension of the clinical variables' categories 
and hierarchical semantic structure. 

(3) Mapping Visual: For each variable that has been mapped 
to a CDE, we offer a comprehensive graphical representation 
(Figure 3) displaying their link and the rule the transforma- 
tion is made. This view will be later used as a reference to 
make the actual mapping transformations. 

(4) Quality Control Tool Report: For each batch of data we 
receive from a hospital, the OCT produces a report about the 
whole batch, as well as a report for each ofthe variables. Both 
reports are displayed in a table like structure. On this view, 


we can index each variable to view its report or download it 
(csv format) for further analysis. 


We have to note that for the CDEs only the first two views (flat 
and tree view) are offered. There is no mapping visual since the 
CDEs are the schema to map to and we do not produce a quality 
report for them, because there are no pure CDE datasets. The tree 
view is an essential component in providing insight on which CDE 
a hospital's local variable should be mapped to. A user, knowing all 
the current categories and their hierarchy, can isolate the category 
that is of interest to her and either find a CDE that the variable 
can be mapped to or if no CDE is semantically equivalent, she can 
still place the variable in a new category that is more appropriate, 
by creating a new leaf node. Furthermore, there are some cases of 
variables belonging to categories that are not already contained in 
the hierarchy of the CDEs. In this occasion a new category will be 
created (intermediate node) for the new version and displayed in 
the tree. The Data Governance pertinent committee, composed of 
clinicians and researchers, makes decisions periodically on CDEs' 
maintenance and enrichment. When new clinical variables and 
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categories see the light, they are considered to be candidates for the 
CDE-stamp on the next version. To be approved though and become 
part of the CDEs’ global schema they have to have a certain level 
of commonness between hospitals as well as a scientifically clear 
and complete definition along with values’ range or enumeration. 
Lastly, they have to be considered having a significant potential for 
contributing to the MIP analyses’ results. 


Meta-Data Management: There are two ways a user can create a 
new version: 


(1) Uploading Version File: The user can upload a file (.xlsx) 
in a specified format containing the meta-data for all the 
variables. If the file passes the integrity checks a new ver- 
sion will be created. We also offer prior to the procedure an 
empty template file containing all the appropriate columns 
and give detailed instructions on how to complete each one 
of them. This method is suitable for uploading large amount 
of information mainly for the creation of the first versions 
of the variables. 

When parsing a file for the creation of a new meta-data ver- 
sion, DC generates a hierarchical tree out of a flat input, 
which although being flat has the clinical variables' taxon- 
omy expressed via their conceptPaths. Our system manages 
to correctly parse the variables of any kind of hierarchy. 
Every node in the taxonomy may be a category or a leaf 
(variable). Each row of the file contains the description of 
either a variable or a category. It is obligatory to give a defi- 
nition for each different variable, whereas for the categories 
it is optional. If a category is not defined, we can still create it 
as long as it is referenced in the hierarchy of its children. The 
sequence in which the variables or the categories are given 
in the file is irrespective i.e., children may precede their par- 
ents or the other way around. Our parsing algorithm, which 
uses DFS recursive traversal, will produce the correct tree 
as long as the given conceptPaths do not contain errors. 
Procedure ADDNODEY) is executed for all rows of the input 
flat file to create an in-memory hierarchical structure of the 
variables. This variables' tree is serialized for the purposes 
of meta-data tree visualization as well as data importing as 
already stated. 

(2) New Version GUI: We also offer a GUI for each version 
creation. The GUI displays all the variable information of 
the previous version and gives the ability to either delete or 
change them as well as add information about a completely 
new variable. After all the appropriate changes are made 
the version is submitted to be saved. Given that all integrity 
tests pass a new version is created. The GUI offers a good 
and easy way to create minor changes to the already existing 
version. By constantly making minor corrections to the last 
version we be gradually creating higher quality information. 


Authorization: All types of meta-data searching and viewing are 
offered to all users without having to login to the platform. In order 
the user to be authorized to create a new version for hospital data 
or CDEs she has to login and also have the required rights provided 
in a centralized way by the Medical Informatics Platform. 
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Algorithm 1 Creating the variables' tree 


1: procedure ADDNoDE(row, root) 
root's Tree 
2: concept|] — split(row.conceptPath, ”/”) > split conceptPath 
into its parts and store them into an array 
for each concept|i] do 
DFS on root for Node having concept[i] 
if Node is in root’s Tree then 
if concept[i] is the last part of row.conceptPath then 
Update the existing Node with row’s metadata > 
if a node for row already exists it should have been created when 
processing one of its children since the current row is the one dedicated 
to it having all the element’s information 


> add row’s variable/category to 


8: else 
9: DFS on root for the parent of the Node to add 
10: if concept[i] is the last part of row.conceptPath then 
11: Create the Node for row 
12: else 
13: Create a Node with the up until now conceptPath > 


Even though the current row is not dedicated to this element we will 
create a node for it now. After all, the user is not obligated to give 
meta-data for the intermediate elements of the tree 

14: Add the new Node 

15: root — the Node for row» So as not to search from the root 
of the tree in next concept[i] iteration 


16: procedure CREATETREE(input File) 


17: Create the root of the Tree 
18: for each row in inputFile do 
19: ADDNODE(row,root) 


Technologies Used: The MDMT is temporally being hosted!! 
within the premises of Athens University of Economics and Busi- 
ness (AUEB) and its source code is available at github!?. The system 
has been built with the following technologies: 


Java and Spring Boot for the server side application logic. 
PostgreSQL for data storage. 

Angular and TypeScript for client side User Interface (UI). 
D3 for data visualizations. 


Git for source code version control. 


5 QUALITY CONTROL TOOL 


General Description: The main functionality of the OCT is to 
ensure a good level of data quality by producing a report (in csv and 
pdf) describing an incoming hospital dataset before inserting it into 
the MIP platform and after it has been inserted in the harmonized 
database. At this stage of development, the tool can process datasets 
in the form of tabular data (csv) and DICOM datasets containing 
MRI sequences. The QC tool produces a different report for each 
kind of dataset which is meant to be evaluated by a human. If the 
dataset meets the minimum specifications, it is inserted into the 
MIP. 


5.1 Tabular Data 


In the case of a tabular dataset, the produced report includes a 
set of statistics profiling the missing values across the rows (per 


1http://195.251.252.222:2442/hospitals 
2https://github.com/HBPMedical/DataCatalogue 
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Figure 3: The graphical representation of the variables! mapping to CDEs. 


observation) and columns (per variable). Moreover, a set of sta- 
tistics is calculated per variable depending on the variable type. 
At the current state, MIP local layer has three types of variables - 
numerical, nominal and text. For each numerical variable the OCT 
calculates a set of descriptive statistics - mean, standard deviation, 
minimum, maximum, 1st, 2nd and 3rd quantile. In addition to that, 
based on those measurements, the OCT estimates the number of 
rows with possible outliers per numerical variable. The report for 
a nominal (categorical) variable provides information about the 
number of categories and their labels, the most frequent category 
and the number of occurrences of the latter. Likewise, for a text 
variable the report provides the same information about the most 
frequent value, but also includes the 5 most and 5 less frequent 
values of the text variable. All in all the statistics produced for the 
variables are the following: 


e Variable Name: The column name in the given dataset. 
e Type Declared: The data type of the variable as declared in 
the file. 


Type Estimated: The data type of the variable as estimated 
by the OCT. 

List of Category Values(nominal variables): A string 
with the category values of the variable (confined in sin- 
gle quotes and separated by commas). 

Number of Category Values(nominal variables): The 
total number of the category values of the variable. 

Count of Unique Values (text variables): The count of 
unique string values. 

Most Frequent Value (text and nominal variables): The 
most frequent category value of the variable. 

Number of Occurrences for Most Frequent Value (text 
and nominal variables): The number of occurrences for 
most frequent category value of the variable. 

Count of Records Filled In: The number of rows that are 
filled. 

Percentage of Non Null Rows: The percentage of filled 
rows. 
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e Mean Value (numerical variables): The average of the 
value of the variable. 

e Standard Deviation(numerical variables): The standard 
deviation. 

e Minimum Value(numerical variables): The minimum value 
of the variable. 

e Maximum Value(numerical variables): The maximum 
value of the variable. 

e First Quantile (numerical variables): The value of the 
variable that 25% percent of records are below it. 

e Median (numerical variables): The value of the variable 
that 50% percent of records are below it. 

e Third Quantile (numerical variables): The value of the 
variable that 75% percent of records are below it. 

e Number of Outliers (numerical variables): The values 
of the variable that are outside three standard deviations 
from mean value. 

e The 5 Least Frequent Values (text variables): A string 
with the 5 least frequent values of the variable separated by 
commas. 

e The 5 Most Frequent Values (text variables): A string 
with the 5 most frequent values of the variable separated by 
commas. 

e Comments: A string with various messages about the the 
values of the variable. 


5.2 Imaging Data 


In the case of a DICOM dataset, the OCT has the ability to recognize 
the MRI sequences in a given folder and extract their meta-data 
(headers). Based on this meta-data, the OCT performs a validation 
of the MRI sequences and exports the results in a report. HBP 
MIP has some specific minimum requirements that every DICOM 
sequence must meet in order to be inserted to the platform. The 
requirements that each image should meet are the following: 


(1) The images must be full brain scans. 

(2) The images must be provided either in DICOM or NIFTI 
format. 

(3) The images must be high-resolution (max. 1.5 mm) T1-weighted 
sagittal images. 

(4) The images must contain at least 40 slices. 


If the imaging file is not readable or the requirements are not met 
the tool produces a report detailing the reasons the file is rejected. 


5.3 Technologies Used 


The OCT is employed locally within the hospitals and is utilizing 
the following technologies and libraries: 


Python 3.5 for the development of the main frame. 
Numpy - Pandas to handle the data and produce statistics. 
Pydicom to read DICOM files. 

Tkinter to create the User Interface (UI). 


6 FUTURE WORK 


Currently we are working towards the improving and robustisa- 
tion of the Data Factory pipeline so as to guarantee hospital data 
harmonization and ingestion into the MIP with the less possible 
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human intervention along with high data quality. To that scope we 
are designing a complementary Quality Control Tool that will be 
giving recommendations for value corrections that can be useful to 
hospitals’ personnel. 

Towards providing more to the MIP user than numeric statistical 
results, we plan on incorporating some of LORIS' images visualiza- 
tion and quality control features for the clinicians to be able to view 
in 3D the collected brain scans. Due to the brain scans being data of 
high sensitivity, to comply to all privacy rules this will be a feature 
accessed only in MIP Local by a few authorised users. Regulations 
not allowing exposing raw data to the federation which can be 
used for personalizing the data (link the information to the actual 
patient) are an obstacle to giving clinicians access to a large corpus 
of actual brain scans collected from different hospitals. 


7 CONCLUSION 


The data pipeline of MIP is complex and demands high quality guar- 
antees due to the sensitive nature of the information. The schema 
of the variables and the CDEs is constantly changing due to ever 
growing need to capture more information and due to corrections. 
The need of understanding the hospital variables is essential for 
the process of mapping them to CDEs. We are presenting an end- 
to-end tool for meta-data management that offers an efficient way 
of tracking changes in both the meta-data and the schema of the 
hospital variables, and provides information that are necessary for 
the mapping tasks. The tool acts as a single point of reference and 
utilizes graphical representations in order to make the data more 
comprehensive. Moreover, our QCT produces quality reports for 
both tabular and imaging data, that can be reviewed by a researcher, 
in order to accept, correct or eliminate records. 
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ABSTRACT 


Distributed ledgers allow us to replicate databases of records 
across mutually untrusted parties. The best known exam- 
ple of distributed ledger is perhaps the Bitcoin blockchain, 
which maintains a consistent history of financial transac- 
tions organized as a hashed chain of blocks. Distributed 
ledgers can be public, i.e., accessible by everyone, or pri- 
vate, i.e., accessible only by a given consortium of parties. 
In this paper, we explore the technological possibilities of 
applying Identity-Based Encryption and Attribute-Based En- 
cryption to distributed ledgers. We introduce the novel con- 
cept of Virtual Private Ledger. A Virtual Private Ledger is a 
private distributed ledger embedded in a public cryptocur- 
rency ledger by means of cryptography. A Virtual Private 
Ledger provides for the same confidentiality and integrity of 
a private distributed ledger, but without its high operational 
costs. In particular, nodes that maintain the ledger do not 
have to be always online to trust the order and the integrity 
of the records. We analytically show that Virtual Private 
Ledgers can be implemented over many existing cryptocur- 
rency ledgers like Ethereum, EOS.IO, IOTA, XRP. Different 
cryptocurrencies lead to different trade-offs between the Vir- 
tual Private Ledger max record size, cost, validation time, 
and max consortium members. 


CCS CONCEPTS 


* Networks — Peer-to-peer networks; - Security and pri- 
vacy — Network security; 
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1 INTRODUCTION 


In the last years distributed ledgers have been object of many 
studies in different IT sectors, such as smart home [13], smart 
grid [19], healthcare [20], smart city [3], and so on. In general, 
a distributed ledger is useful when multiple peers, possibly 
having conflicting interests, want to agree on a shared history 
of records. It allows the peers to trust record order and con- 
sistency without trusting any other peer in the network [25]. 
Distributed ledgers can be public or private. In private dis- 
tributed ledgers, records are available only to a restricted 
group of authorized parties, called consortium. This is use- 
ful when records carry privacy-sensitive or business-critical 
information. 

One problem of employing distributed ledgers on a vast 
scale is that they require the peers to maintain a complete 
local copy of the entire ledger [23]. Depending on the specific 
application, such a ledger can be quite large. For example in 
Internet of Things applications, data is produced not only by 
human beings but also by "things". The number of Internet of 
Things devices is exponentially increasing and it is expected 
to reach more than 20 billion devices by 2020 [15]. It is also 
estimated that nearly 850 Zbytes will be generated by all 
people and things by 2021 [11]. Maintaining copies of such 
data in all the peers involved in a distributed ledger could 
be prohibitively costly. Moreover, peers must be constantly 
online to perform consensus protocol for each block of data 
to be recorded in the ledger. 

The inspiration for this paper comes from Virtual Private 
Networks (VPN), which allow us to build private networks 
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over public insecure ones by means of cryptography. Analo- 
gously, we explore the possibility of implementing a Virtual 
Private Ledger (VPL) by embedding it in an existing cryp- 
tocurrency ledger by means of cryptography. Cryptocurren- 
cies often organize their distributed ledger as a blockchain, 
which is a recent and special kind of distributed ledger that 
maintains a consistent history of records organized as a 
hashed chain of blocks. Blockchain technology is having a 
great momentum in the last years due to its application in the 
widespread Bitcoin cryptocurrency [21]. A VPL provides for 
the same confidentiality and integrity of a private distributed 
ledger, but without its high operational costs. In particular, 
peers do not have to be always online to trust the order and 
the integrity of the records. They do not have to execute the 
cryptocurrency consensus algorithm or to maintain a local 
copy of the whole ledger. The blockchain peers will do it 
for them as long as they are incentivized by maintaining the 
consistency of the cryptocurrency transactions. We also give 
a mechanism by which peers can reach a consensus on the 
semantic validity of the records, and not only their order and 
integrity. For example, if the private ledger contains GPS 
traces of the customers of a car insurance, such traces can 
be checked to be consistent with other measurements, for 
example those coming from electronic toll collection systems 
on highways. 

The rest of the paper is organized as follows. Section 2 
introduces the main technological aspects of distributed 
ledgers and blockchains. In Section 3 we review some rele- 
vant related work. Section 4 introduces the Virtual Private 
Ledger concept, and our reference threat model. Section 5 
analytically investigates the feasibility of embedding a VPL 
in existing cryptocurrency blockchains. Finally, Section 6 
concludes the paper. 


2 PRELIMINARIES 


A distributed ledger is a replicated database of records shared 
across a network of multiple sites, geographies or institu- 
tions [27]. A distributed ledger is maintained in a distributed 
fashion, and it does not need for central administration or 
centralized data storage. All participants have their own 
identical copy of the ledger. Any changes to the ledger are 
reflected in all copies in minutes, or in some cases, seconds. 
In order to reach agreement on the ledger status a consensus 
protocol is needed. Different consensus protocol has been 
proposed in literature, with different security, scalability and 
timing properties [29]. Distributed ledgers can be both public 
or private. Public distributed ledgers are publicly accessible, 
so that anyone can download and read the entire record his- 
tory. Therefore, it is not secure to use a public distributed 
ledger to carry privacy-sensitive or business-critical informa- 
tion. In contrast, in private distributed ledgers the access to 
records is restricted to the members of a consortium, which 
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Figure 1: Typical structure of a blockchain 


are authorized by a central authority. A consortium member 
can be a client or a peer. The clients simply read and write 
records on the ledger. The peers are also in charge of execut- 
ing the consensus protocol. Examples of private distributed 
ledgers are Hyperledger Indy [1] and Tendermint [9]. Private 
distributed ledgers typically use variations of the Practical 
Byzantine Fault Tolerance protocol [26] as a consensus proto- 
col. A private distributed ledgers can securely carry personal 
and critical information, since only the authorized entities 
can access it. However, their operational cost is generally 
quite expensive, since all the peers have to maintain the 
entire record database. 

In the last years the distributed ledger technology gained 
great interest in both academia and industry thanks to in- 
troduction of the blockchain technology. A blockchain is a 
distributed, tamper-proof distributed ledger, whose typical 
structure is shown in Fig. 1. It is a list of ordered blocks, 
where each block stores a variable-size list of records. Each 
block is chained to the previous one, by including the hash 
value of it. The blockchain is maintained in a distributed 
fashion by a set of peers, which participate to the consensus 
protocol. The records not yet included in a block are collected 
by peers, which gradually fill up a new block that may be dif- 
ferent for every peer. Typically, when this new block reaches 
a predefined maximum size or when a predefined timer ex- 
pires, a distributed consensus protocol can start. As a result 
of the consensus protocol, one peer is elected as temporary 
central peer, and it decides the next block to be added to the 
blockchain. The temporary central peer signs the block and 
broadcasts it to all peers so that they can verify that the block 
was built from valid records and possibly append it to their 
locally maintained blockchain. The block header included 
in this block conveys all the information needed to verify 
the correctness of the executed consensus protocol. Every 
blockchain starts with a special block, called genesis, which 
does not reference a previous block and must be known a 
priori by all the peers. The structure of the blockchain guar- 
antees us that any change on the order or the content of the 
block records would entirely change the successive blocks 
of the blockchain. For this reason, changing something in 
the blockchain would require to run several instances of the 
consensus protocol again, which is practically unfeasible. 

Historically, the first proposed consensus protocol for 
blockchains has been the Proof of Work (PoW) protocol [21], 
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used for example in Bitcoin. The PoW protocol is based on 
finding a solution of a hard-solving mathematical problem 
(puzzle). The puzzle must be hard to solve, but it must be easy 
to verify that a solution is correct. A typical puzzle, used in 
Bitcoin and many other cryptocurrencies, is finding a quan- 
tity to include inside the block such that the block’s hash is 
below a predefined target. The peer who first finds a solution 
automatically becomes the temporary central peer, and it 
decides the next block to be added to the blockchain. In order 
to incentivize peers to spend computational resources for 
solving puzzles, the PoW protocols typically reward some- 
how the temporary central peer. For example, in Bitcoin the 
temporary central peer gains a fixed quantity of Bitcoins!. 
The emerging blockchain technology takes a relevant role 
also in several application scenarios like smart home [13], 
smart grid [19], healthcare [20], smart city [3], and so on. In 
general, a blockchain network is useful when multiple enti- 
ties having conflicting interests want to agree on a shared 
history of records. It allows nodes to trust record order and 
integrity without trusting any node in the network [25]. 


3 RELATED WORK 


Dorri et al. [14] proposed a distributed ledger solution for 
vehicular ecosystems, based on blockchain. Ledger records 
store the hash values of data generated by in-vehicle sensor, 
e.g. GPS traces, brakes utilization, traffic information etc. 
The actual data is instead stored in the vehicle themselves, 
leveraging a mass storage such as an SD card, and it is re- 
trieved only in case of real need, e.g., car accidents. This 
allows us to guarantee more privacy over data, and at the 
same time it reduces the size of the distributed ledger, thus 
reducing its operational cost. However, data is not available 
in any moment by nodes. Moreover, it is not possible to ver- 
ify the semantic validity of data at the moment of a ledger 
update, so that the ledger could contain invalid data at any 
moment. For example, if the ledger contains GPS traces for a 
car insurance application, such traces could be inconsistent 
with other measurements, for example those coming from 
electronic toll collection systems on highways. Our proposal 
reduces the operational cost of the private ledger by embed- 
ding it on a public cryptocurrency blockchain. This allows 
us to guarantee data availability to peers, and to enforce a 
semantic validity over data. 

Zyskind et al. [30] combined a distributed ledger with 
off-ledger storage to construct a personal data management 
platform focused on privacy. Data is stored in an off-chain 
distributed hash-table which is enforced through an access 
control manager implemented on top of a blockchain. The 
off-blockchain storage is maintained by a network of trusted 
nodes or simply by a centralized cloud. In our proposal we do 


1At the time of writing, approximately 12.5 BTC. 
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not need an off-blockchain storage since data are encrypted 
and directly included on a public blockchain. The access 
control mechanism is not implemented with a blockchain 
but rather by cryptography. 

Hyperledger Indy [1] and Tendermint [9] are private dis- 
tributed ledger technologies that use variations of the Practi- 
cal Byzantine Fault Tolerance protocol [26] to reach consen- 
sus among peers. Their operational cost is generally quite 
high, since all the peers have to maintain the entire ledger. 
Moreover, peers must be constantly online to perform con- 
sensus protocol for each block of data to be recorded in 
the ledger. With our approach, consortium members do not 
have to be always online to trust the order and the integrity 
of the records. They do not have to execute the consensus 
algorithm or to maintain a local copy of the whole ledger. 
The peers of the underlying public blockchain will do it for 
them as long as they are incentivized by maintaining the 
consistency of the cryptocurrency transactions. 


4 VIRTUAL PRIVATE LEDGER 


A Virtual Private Ledger (VPL) is a private distributed ledger 
embedded inside a public cryptocurrency blockchain by 
means of cryptography. Broadly speaking, the records of 
the VPL (VPL records) are encrypted and stored inside the op- 
tional data of the transactions ofthe cryptocurrency blockchain. 


The majority of modern cryptocurrencies (e.g., Bitcoin, Ethereum, 


etc.) support optional data to be included in every transaction. 
Note that not all the cryptocurrencies require to perform 
an actual money transfer in order to store a VPL record, as 
they allow for transactions involving zero coins. In this way 
it is possible to include a VPL record in a cryptocurrency 
transaction without transferring money. 


4.1 VPL Architecture 


The general architecture of a VPL is shown in Fig. 2. A VPL 
member is an entity that can access the VPL, because it 
possesses the necessary keys to decrypt and authenticate 
VPL records. The set of all the VPL members is called the VPL 
consortium. A VPL member can be a VPL clients or a VPL peers. 
The VPL clients simply read and write records on the VPL. 
VPL peers are in charge of executing the validation protocol. 
The validation protocol is a consensus protocol which aims 
at semantically validate the VPL records. For example, if 
the ledger contains GPS traces of the customers of a car 
insurance, such traces can be checked to be consistent with 
other measurements in the ledger, for example those coming 
from electronic toll collection systems on highways. The 
validation protocol can be any kind of consensus protocol, 
for example a PBFT protocol which reaches a consensus 
within short times compared to proof-of-work protocols [26] 
and resists up to 33% malicious VPL peers in the consortium. 
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Figure 2: Architecture of a VPL 


When a VPL client wants to add some data in the VPL, it 
encrypts the data thus obtaining in such a way a new VPL 
record. Then, the VPL client stores the VPL record inside the 
optional data of a transaction of the public cryptocurrency 
blockchain. Such a transaction can also involve zero coins. 
Finally, the VPL client publishes such a transaction in the 
public cryptocurrency blockchain, and it notices the VPL 
peers about such new VPL record. If the VPL record needs a 
semantic validation, then the VPL peers decrypt it and exe- 
cute the validation protocol. The VPL peers independently 
check for the VPL record semantic validity and reach a con- 
sensus on the outcome. After VPL peers agreed that the VPL 
record is semantically valid, they publish a VPL validation 
record on the cryptocurrency blockchain, which proves such 
a validity to the VPL clients. Which specific peer is in charge 
of publishing the VPL validation record is out of the scope of 
this paper. For example, the peers can follow a round-robin 
policy, and reach consensus on which peer must publish the 
next VPL validation record. Such a consensus can be reached 
contextually to the execution of the validation protocol. 

When a VPL client wants to read some data from the VPL, 
it simply retrieves the relative VPL record from the public 
cryptocurrency blockchain, and decrypt it. Depending on 
the type of encryption employed, access control rules can 
be enforced on VPL records. For example, VPL clients can 
read all the records or only a subset of them. If the VPL 
record underwent a semantic validation, then the VPL client 
also retrieves the relative VPL validation record from the 
cryptocurrency blockchain, which proves the validity of the 
VPL record. 


4.2 VPL Records 
Fig. 3 shows the format of a VPL record. The VPL record is 
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Figure 3: VPL record format 


produced by a VPL client and stored inside the optional data 
field? of a cryptocurrency transaction. The block header and 
the transaction header depend on the format of the cryp- 
tocurrency blockchain. We only require that the transaction 
header contains a signature of the whole transaction includ- 
ing the optional data, in such a way that the VPL record is 
signed by the VPL client that produced it. This is assured 
by all the major cryptocurrencies. The VPL header contains 
an identifier of the VPL consortium to which the record be- 
longs to. It contains also the identifier of the key (or keys) 
able to decrypt the data. The encrypted payload is the ac- 
tual data encrypted by means of some form of cryptography. 
We identified two possible forms of suitable cryptography, 
depending whether the consortium wants to enforce an ac- 
cess control on data. In a VPL without access control, all 
the clients and peers are authorized to read all the records. 
The access to such records must be denied only to entities 
external to the consortium. A VPL without access control 
can be obtained by encrypting records with Identity-Based 
Encryption (IBE) [5, 7, 8, 18]. IBE is capable of encrypting a 
record in such a way that only who has a particular identity 
can decrypt it afterwards. Such an identity can refer to a 
single entity as well as to a group of entities. In our case, the 
VPL records must be encrypted with the identity of the VPL 
consortium, in such a way that all the VPL members (clients 
and peers) can decrypt them. On the other hand, in a VPL 
with access control the clients are authorized to read some of 
the records, following a fine-grained access control, whereas 
the peers can access them all. A VPL with access control can 
be obtained by encrypting records with Ciphertext-Policy 
Attribute-Based Encryption (CP-ABE) [2, 24]. CP-ABE is ca- 
pable of encrypting a record in such a way that only who 
fulfills a particular access policy can decrypt it afterwards. 
Such an access policy is expressed through a Boolean for- 
mula computed over the attributes that describe the client. 
The client is able to decrypt the record only if such a formula 
evaluates to true. The access policies must be such that the 
peers can always decrypt any record. 


?In some cryptocurrencies the optional data field is called memo field. 
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Figure 4: VPL validation record format 


4.3 VPL Validation Records 


Fig. 4 shows the format of a VPL validation record. The VPL 
validation record is stored inside the optional data field of a 
cryptocurrency transaction. Also here, we require that the 
transaction header contains a signature of the whole transac- 
tion including the optional data, in such a way that the VPL 
validation record is signed by the peer that produced it. The 
VPL header contains an identifier of the VPL consortium to 
which the record belongs to. It contains also the identifier of 
the VPL record that it validates. The validation proof contains 
the proof that the VPL peers reached consensus in declaring 
the VPL record valid. The validation proof must maintain 
all the useful information for the VPL clients to correctly 
determine whether the consensus protocol correctly ended. 
As we already mentioned above, the consensus protocol can 
be any kind of consensus protocol, for example a Practical 
Byzantine Fault Tolerance protocol (PBFT), which reaches a 
consensus within short times and resists up to 33% malicious 
VPL peers in the consortium. In the case a PBFT protocol 
is employed, the validation proof must contain the positive 
outcome of the validation signed by all the VPL peers [10]. 


4.4 Threat Model 


The main adversary against a VPL is a client wanting to store 
semantically inconsistent data in the VPL, possibly colluding 
with one or more peers. For example, think about a car in- 
surance application where costumers pay a premium based 
on how many miles the car traveled. Suppose that costumers 
store their GPS traces on a VPL that gathers also other in- 
dependent measurements, for example those coming from 
electronic toll collection systems on highways. The insur- 
ance company computes the premium basing on such GPS 
traces. A malicious VPL client could try to store GPS traces 
inconsistent with the records of the electronic toll collection 
system. In this way, he can pay an insurance premium lower 
than the real one, computed on fake GPS traces. 

In the most simple attack scenario, a malicious VPL client 
generates a VPL record carrying semantically inconsistent 
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data. However, the VPL peers will reject the semantic validity 
of this inconsistent VPL record, so they do not generate the 
VPL validation record. In a more complex attack scenario, the 
client initially generates a VPL record carrying semantically 
consistent data, so that the peers will generate a VPL vali- 
dation record for the VPL record. After the VPL validation 
record is published on the blockchain, the malicious VPL 
client tries to modify the encrypted payload of the validated 
VPL record. However, the malicious client cannot do that 
since the underlying cryptocurrency blockchain provides 
for immutability. The VPL client may also collude with one 
or more VPL peers. In this scenario, the client generates a 
VPL record carrying semantically consistent data, and the 
colluding peers try to reach consensus on the validity of 
this inconsistent VPL record among all the peers, so that 
a VPL validation record is generated and published on the 
blockchain. However, the colluding peers cannot do that un- 
less they are more than 33% of the total VPL peers. This is 
because the PBFT protocol is employed to reach consensus 
on the semantic validity for a VPL record. 


5 FEASIBILITY ANALYSIS 


In this section we study the feasibility of implementing a Vir- 
tual Private Ledger over a public cryptocurrency blockchain. 
To this aim, we considered the Boneh-Franklin (BF) encryp- 
tion scheme [6] and the Bethencourt-Sahai-Waters (BSW) 
encryption scheme [2], which represent classic schemes re- 
spectively for IBE and CP-ABE. We remind that IBE is capable 
of encrypting a VPL record in such a way that only who has 
a particular identity can decrypt it afterwards. In our case, 
the VPL records must be encrypted with the identity of the 
VPL consortium, in such a way that all the VPL members 
can decrypt them. On the other hand, CP-ABE is capable 
of encrypting a VPL record in such a way that only who 
fulfills a particular access policy can decrypt it afterwards, 
thus granting a fine-grained access control with which VPL 
member can decrypt a VPL record. 

We considered the minimal set of fields that must be in- 
cluded in each VPL record and each VPL validation record. 
A VPL record must always include a VPL record identifier, 
which is needed to relate a VPL validation record with a VPL 
record, and a VPL consortium identifier, which is an identifier 
for distinguishing different Virtual Private Ledgers on the 
same cryptocurrency blockchain. We assume VPL records 
identifiers are on 16 bytes and VPL consortium identities on 
8 bytes. We also took into consideration the encryption over- 
head, that is the size difference between a plaintext and the 
corresponding ciphertext, which depends on the employed 
encryption scheme. Assuming a security level of 80 bits, the 
BF encryption scheme adds an encryption overhead of 84 
bytes [6], whereas the BSW encryption scheme adds an en- 
cryption overhead which depends on the access policy with 
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Table 1: VPL record and VPL validation records fields 
dimensions 


Field Size 

VPL record identifier 16 bytes 
VPL consortium identifier 8 bytes 
BF encryption scheme overhead? 84 bytes 
BSW encryption scheme overhead? * | 832 bytes 
5-attribute policy representation 30 bytes 
VPL validation record identifier 16 bytes 
Semantic validation outcome 4 bytes 
VPL peer identifier 4 bytes 
ECDSA signature? 40 bytes 


which the client encrypts the VPL record. Supposing a se- 
curity level of 80 bits and access policies of 5 attributes, the 
BSW encryption scheme adds an encryption overhead of 832 
bytes. If a CP-ABE scheme is employed, we need to embed in 
the VPL record also a representation of the access policy that 
a VPL member needs to fulfill for decrypting the VPL record. 
An access policy is a Boolean formula specified by the client 
that encrypts the VPL record. We assume access policies of 5 
attributes are represented with strings of 30 bytes. To sum up, 
the minimum VPL record size (not counting the data payload 
which depends on the specific application) is 108 bytes if the 
VPL does not employ access control mechanisms, and 886 
bytes if the VPL employs an access control mechanism with 
access policy of 5 attributes. 

A VPL validation record must always include a VPL valida- 
tion record identifier and the identifier of the VPL record that 
it is validating. We assume VPL validation records identifiers 
are on 16 bytes. The VPL validation record must also carry 
the semantic validation outcome, i.e., whether the data carried 
in the VPL record were semantically valid or not, and the 
proof that the VPL peers reached consensus in the declaring 
the VPL record as valid or not. Assuming a VPL consortium 
employs a PBFT protocol among N peers, this proof must in- 
clude N signatures of the semantic validation outcome [10], 
along with the corresponding peer identities, represented 
by VPL peer identifiers. We assume the semantic validation 
outcomes are on 4 bytes, and the peer identifiers on 4 bytes. 
We assume to use the Elliptic Curve Digital Signature Algo- 
rithm (ECDSA) with a security level of 80 bits as signature 
algorithm, which results in a signatures of 40 bytes [17]. To 
sum up, the VPL validation record size is 36 + 44 x N bytes. 
Table 1 summarizes the sizes for all the fields of the VPL 
records and of the VPL validation records. 


3 Assuming a security level of 80 bits. 
^Assuming a fixed size of the attributes set equal to 5. 
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Table 2: Available space in optional data fields with dif- 
ferent cryptocurrencies 


Cryptocurrency | Optional data max size 
Bitcoin 83 bytes ? 

EOS.IO 256 bytes 

Ethereum 98,225 bytes : 

IOTA 1300 bytes 8 

Stellar 28 bytes ? 

XRP 1024 bytes !? 


Table 2 summarizes the maximum available space that can 
be used for embedding a VPL record and a VPL validation 
record in the optional data field of a transaction for six dif- 
ferent cryptocurrencies, namely, Bitcoin, EOS.IO, Ethereum, 
IOTA, Stellar and XRP. We assume to embed VPL records 
and VPL validation records into a single cryptocurrency 
blockchain, as fragmenting a single record in multiple cryp- 
tocurrency transactions would results in high costs and high 
delays in publishing them in a block. By considering the 
VPL record and VPL validation record dimensions described 
above, it is impossible to embed them into a single Bitcoin or 
Stellar transaction, thus making these cryptocurrencies un- 
able to host a Virtual Private Ledger. The EOS.IO blockchain 
can be chosen as underlying blockchain only if the VPL does 
not employ an access control mechanism for VPL records. Fi- 
nally, the Ethereum, IOTA and XRP blockchains offer enough 
available space in a single transaction to host a Virtual Pri- 
vate Ledger. 

Finally, in Table 3 we compare the maximum payload di- 
mension for every feasible blockchain, the minimum cost 
for generating a cryptocurrency transaction!!, the maximum 
number of VPL peers in a VPL consortium and the average 
transaction confirmation time. The transaction confirmation 
time is defined as the time elapsed between the moment a 
cryptocurrency transaction is submitted to the blockchain 
and the time it is recorded into a reached consensus on a 
block. In other words, it represents the total time a VPL client 
has to wait until a transaction gets collected and included a 
block in the blockchain. As we already said, the EOS.IO cryp- 
tocurrency is feasible only ifa VPL consortium does not need 
access control for VPL records. As its maximum optional data 
size is quite low, the maximum number of VPL peers cannot 


^http;//bit.ly/bitcoin transaction size 
Shttp://bit.|y/EOS memo size 
Thttp://bit.ly/ethereum, transaction size 
5http://bit.ly/iota transaction, size 
?http;//bit.ly/stellar memo, size 
Vhttp:;//bitly/xrp memo, size 

H Exchange values computed at April 11, 2019 
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Table 3: Comparison of different feasible cryptocurrency blockchains 


Cryptocurrency Max payload Max payload Minimum Transaction Max 

(no access control) (with access control) transaction confirmation peers N 
cost time 

EOS.IO 148 bytes (unfeasible) 0$ 0.58 5 

IOTA 1192 bytes 414 bytes 0$ [2,10] m 28 

XRP 916 bytes 138 bytes 0.00000337 $ 3.6s 22 

Ethereum 892 bytes 114 bytes 0.059$ 30 s 21 

Max optional data size = 

1000 bytes 

Ethereum 4892 bytes 4114 bytes 0.24$ 30s 112 

Max optional data size = 

5000 bytes 

Ethereum 9892 bytes 9114 bytes 0.46$ 30 s 226 

Max optional data size = 

10000 bytes 


exceed 5 peers. However, the average transaction confir- 
mation time is very low, as in EOS.IO a block is generated 
exactly every 0.5 seconds [4], and it is not required to pay 
for generating transactions. The IOTA cryptocurrency offers 
instead a quite large optional data field, leaving more than 
400 bytes for the payload in case a VPL consortium needs 
access control for VPL records. Moreover, we can have up to 
28 VPL peers in a consortium and it is possible to generate 
transaction without transferring money. However, the IOTA 
transaction confirmation time is quite variable as it depends 
on the global transaction generation rate. In the worst case, 
a transaction may be confirmed even after 10 minutes [12]. 
Furthermore, with IOTA the VPL client must solve a PoW 
puzzle for every VPL record it generates, which is practically 
unfeasible if VPL client is implemented on a battery-powered 
device [16]. In case a VPL consortium need access control 
policies for VPL records, the XRP cryptocurrency is suitable 
only if a VPL record carries a small amount of data, i.e., max 
138 bytes. In XRP a VPL member has to pay a very small 
amount of money for generating a transaction. Moreover, 
the XRP cryptocurrency requires that a node must own at 
least 20 XRP !? for generating a transaction, otherwise the 
transaction would never be included in a block [22]. For this 
reason, a VPL member has to pay for a very small amount of 
dollars for generating VPL records or VPL validation records. 
It is important to notice that XRP offers a good transaction 
confirmation time, which is generally below 4 seconds. 

The Ethereum cryptocurrency needs to be analyzed in 
multiple scenarios. In particular, the cost for an Ethereum 
transaction depends on its size [28]. For this reason, we ana- 
lyzed three different Ethereum optional data sizes, i.e., 1000 


12 At the time of writing, 20 XRP equals to 6.75$ 


bytes, 5000 bytes and 10000 bytes. For every size, we com- 
puted the maximum number of VPL peers and the cost of 
a single transaction. The Ethereum cryptocurrency offers a 
very large optional data field and a moderate transaction con- 
firmation time. However, the cost for a transaction becomes 
prohibitive for large VPL records, up to 0.46$ for a single VPL 
record. If a VPL client generates multiple VPL records in a 
day, the cumulative cost for a VPL client becomes very high. 
Moreover, the VPL peers may not be encouraged to generate 
VPL validation records, if they have to pay for publishing 
them. In this situation, a VPL client may be forced by the 
VPL consortium to pay an extra fee to the VPL peers to cover 
the cost of a VPL validation record, thus making Ethereum 
quite expensive. 

In conclusion, we consider the IOTA cryptocurrency the 
best choice for hosting a Virtual Private Ledger, if a VPL 
consortium does not have timing constraints and if clients 
do not have hardware constraints for performing PoW puz- 
zles. With IOTA, we can also have a moderate number of 
VPL peers in the consortium, and mostly important it is 
not needed to pay for generating IOTA transactions. If we 
need a lower transaction confirmation time instead, the XRP 
cryptocurrency may be the best choice, especially if the VPL 
consortium does not need access control policies for records. 
We finally consider the Ethereum cryptocurrency as the op- 
timal one only if members need to generate very large VPL 
records and a large number of VPL peers is needed, as the 
cost becomes very high. 


6 CONCLUSIONS 


The inspiration for this paper came from Virtual Private Net- 
work (VPN) technology, which allows us to build a private 
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network over a public insecure one by means of cryptogra- 
phy. Analogously, in this paper we explored the possibility 
of implementing a Virtual Private Ledger (VPL) by embed- 
ding it on an existing cryptocurrency blockchain by means 
of cryptography. A blockchain is a recent and special kind 
of distributed ledger, which is having a great momentum 
in the last years due to its application in the widespread 
Bitcoin cryptocurrency. A VPL provides for the same confi- 
dentiality and integrity of a private distributed ledger, but 
without its high operational costs. We presented a general 
architecture for a VPL, by giving a guideline for implement- 
ing it on top of a generic cryptocurrency blockchain. In this 
paper we explored the technological possibilities of apply- 
ing Identity-Based Encryption (IBE) and Cypertext-Policy 
Attribute-Based Encryption (CP-ABE). Finally, we discussed 
the possibility of implementing a VPL on top of different ex- 
isting cryptocurrency blockchains, such as Bitcoin, EOS.IO, 
Ethereum, IOTA, Stellar and XRP. We found out that the 
Bitcoin and Stellar cryptocurrencies are not suitable as they 
do not offer enough space for embedding encrypted data in 
a single transaction. The IOTA, EOS.IO and XRP cryptocur- 
rencies are instead the best choices for implementing a VPL 
on top of them, as they offer enough free space for encrypted 
data and they do not need to pay for generating transaction. 
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ABSTRACT 


Artificial Intelligent Systems are increasingly used to support 
early diagnosis of multiple relevant diseases. The spread of these 
systems is boosted by the application of machine learning 
techniques on datasets (also in the form of videos and images) 
obtained from different information sources. A key role is played 
by artificial vision systems that are in charge of reasoning on data 
acquired from different devices, including smartphones. The 
facility to disseminate and share information let to the 
globalization of medical protocols previously used just in some 
world’s areas. This is the case of tongue inspection, widely used 
in Traditional Chinese Medicine (TCM) to perform a diagnosis, 
which allows physicians to obtain useful indications on the state 
of internal organs by observing the color and the consistency of 
patient's tongue. The current interest in tongue’s image analysis is 
also motivated by the possibility of performing a first self-analysis 
on a possible disease suggesting further medical investigation. 
The paper is a non-exhaustive overview of the features most 
frequently used in artificial vision systems contextualized to 
tongue analysis. It highlights shortcomings in some of the existing 
studies and provides insights for future research. Our work aims 
to provide a unifying view that can support the researchers 
working on Tongue Colored Image Analysis. 
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1. Introduction 


Until recently, family doctor had a great ability to observe and 
evaluate every detail of the body, to grasp changes in the normal 
physiology and signs of disease in progress. Today, however, 
much weight is given to the results of laboratory analysis, giving 
little importance to what emerges from the visit: the risk is that 
therapy is guided by the values of laboratory tests and not by 
results of patient’s medical visit. As for the tongue, notice that in 
the recent past its analysis was a task systematically performed by 
the doctor at each visit. Today this is done much less or not at all. 
Too bad, because tongue analysis can really provide many 
interesting information and suggest the first treatment. In 
Traditional Chinese Medicine tongue analysis is greatly taken into 
account and is used as an extraordinary diagnostic method as it 
has been proved that correlations exist between the appearance of 
the tongue and the patient’s health status. According to different 
branches of Oriental Medicine [26] different areas of tongue 
reflect the functioning of internal organs (Fig. 1). 
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Figure 1: Tongue Reflexology Chart 
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The tongue is divided transversally into three parts, each 
corresponding to some organs: kidneys, adrenal and intestines in 
the proximal third, spleen, liver, pancreas, stomach in the middle 
third and lungs and heart in the distal third. 

The median longitudinal line that divides the tongue into two 
represents the spinal column. The central part that covers the three 
sectors and the three main digestive organs, stomach, small 
intestine and colon, is related to the digestion phases. 

Since ancient times, in-depth studies have been carried out in 
China, the first treatises on this topic go back to the Shang 
Dynasty (1600-1000 BC). Starting from the 1980s, always in 
China, systematic studies published in medical journals, have 
been conducted, on the correlation between the appearance of the 
tongue and some kind of cancer. In particular, the TCM 
Association, the Chinese Oncology Association and the TCM 
Diagnosis Association conducted a national project involving 
12.448 cancer patients, 1.628 patients suffering from other 
diseases and 5.578 healthy people. The results showed that the 
tongue in patients affected by cancer highlights signs (such as 
color, shape, mucus layer, etc.) statistically significant in 
pathognomonic terms compared to that of healthy patients. 
Therefore, it is quite evident that physicians can obtain many 
insights from a correct analysis of the tongue. Often the tongue 
like the wrist can tells in advance the affections that could 
eventually emerge. Obviously, to achieve a correct diagnosis 
tongue signs have to be integrated with additional patient’s health 
information. 


2. How is the tongue made? 


The tongue is an organ of the human body that occupies most of 
the oral cavity; it is composed of various anatomical structures: 
mucous membranes, lingual papillae (also known as taste buds) 
and various muscles. It constitutes the anterior wall of the 
oropharynx. Its dorsal surface constituted by the lingual mucosa is 
convex in every direction and can be divided in two parts, 
different both for appearance and in their embryological origin, 
called body and root of the tongue, or oral portion and pharyngeal 
portion. They are divided by an inverted V-shaped groove called 
the terminal groove, the apex of which constitutes a small cavity 
called the blind bottom. It is connected posteriorly to a small bone 
called a hyoid and anteriorly to a small and thin filament called a 
frenulum or a thread. The tongue is endowed with taste buds, and 
is, in fact, the main organ of taste. It performs the function of 
kneading food with saliva and pushing it under the teeth to be 
crushed, and then pushed down the esophagus. The body of the 
tongue constitutes 2/3 of its volume, is longitudinally divided by 
the median groove, which originates posteriorly at the apex of the 
tongue and terminates anteriorly to the terminal groove, near the 
blind hole. With the mouth closed, the lower surface of the tongue 
body is in contact with the floor of the mouth, the apex with the 
upper incisors, the lateral margins with the gingival arches and the 
upper surface with the hard palate and the soft palate. The dorsal 
surface is covered by a transparent whitish patina consisting of the 
precipitate on the palate coming from the stomach exhalations 
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through the esophagus. The color, thickness, consistency and ease 
in removing any present patina give rise to indications on the state 
of the digestive function. On the upper surface of the tongue, 
anterior to the palatoglossal arch and posterior to the terminal 
furrow there is an area in which there are 4-6 mucous folds that 
constitute the residues of foliate papillae, present and functional in 
many animals, but not in humans. The taste buds are distinguished 
in: 


*threadlike, which appear in the form of a diffused and tiny 
punctuation, are spread over the entire dorsal surface of the body 
of the tongue, in particular at the apex, 


*fungiform, small, raised and rounded, less numerous than the 
filiform and also distributed over the entire surface, 


*circumvallate, more detected and rounded than the others, 
arranged only along the terminal groove. 


The mucosa of the lower surface is red and has a slimy 
consistency. Two mucosal growths, called fimbriate folds, 
originate posteriorly and laterally at the base of the tongue and are 
directed antero-medially defining a triangular area. Medially to 
these, superficially and following their course, the two deep 
lingual veins branch off. The lingual frenulum instead connects 
the lower surface of the tongue with the floor of the mouth. 
Laterally, at its base, the two sublingual papillae are placed from 
where the ducts of the submandibular glands open up, through an 
orifice. On the contrary, the orifices of the sublingual glands are 
numerous and placed post-laterally with respect to those of the 
submandibular. 

The root of the tongue includes the posterior part of the tongue, 
i.e. the one comprised anteriorly between the palatoglossal arch 
and posteriorly between the palatopharyngeal arch. Its surface has 
vaguely rounded reliefs that make up the protrusion of lymphatic 
nodules immersed in the lamina of the lingual mucosa; the set of 
lymphatic nodules constitutes the lingual tonsil. 

On the apex of each nodule, the ducts of tubulo-acinar glands 
open up. Back and side to the lingual tonsil there are the two 
palatine tonsils, about 1 cm long, housed in spaces between the 
palatoglossal arch and the palatopharyngeal arch, called palatine 
pits. Back and inferior to the lingual tonsil there is a plica of 
elastic cartilage, the epiglottis, that has two lateral glossoepiglottic 
fold and a median one. 
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Figure 2: Anatomy of the tongue 
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3. A road map on tongue image classification 


The diagnosis of the tongue is one of the most successful lines of 
research in complementary medicine together with those of 
palpation of the wrist and abdomen [1]. The coloring of the 
tongue, the structure and the geometrical conformation, are placed 
in correlation with the pathologies of some diseases. The objective 
is to define protocols in clinical practice and to improve the 
contribution of computerized systems for the analysis of tongue 
images. In traditional oriental medicine, tongue color is a 
discriminating element for diagnosing diseases due to physical 
and mental disorders such as blood congestion, water imbalance 
and psychological problems [2]. 

A group of researchers used a tongue color gamut descriptor, 
providing the use of SVM for the classification phase [3, 4]. 
These researches confirmed that the color range of the tongue is 
very narrow and varies in shades of red. To the naked eye, the 
colors on different regions of the tongue could appear almost 
similar, then not permitting a correct diagnosis. Current research 
in the computerized tongue image analysis system uses machine 
learning techniques to achieve a superior accuracy rate and a 
shorter execution time. Therefore, choosing the most effective 
features to carry out the classification phase therefore becomes 
necessary. The adoption of too many features implies a complex 
descriptive mapping in the classifier [5, 6]. 

TCM has a millennial experience, and boasts significant healing 
benefits with little side effects. Modern medicine, on the other 
hand, is focused on the cause-effect relationship: sometimes the 
side effects are important both in the medium and in the long 
term. Compared to modern medicine, some practices of TCM are 
potentially applicable in health, and rehabilitative care protocols. 
There are four fundamental diagnostic methods in TCM: 
inspection, smell, interrogation and palpation [7]. 

The diagnosis of the tongue is one of the most current topics in 
the field of medicine both for the diagnostic potential of analysis 
on digital images and for the simplicity with which such images 
can be obtained. Images of the tongue are increasingly used in 
clinical work: the recovery of images and their computer 
management has become a difficult topic. Traditional 
management of tongue images foresees a manual labeling of the 
information regarding images and a research phase based on 
previously obtained information. However, in so doing the needs 
related to the efficient recovery of images in large-scale datasets 
cannot be met. On the other hand, traditional tongue diagnosis 
depends on doctors' experience and it is likely that different 
doctors will produce different diagnoses for the same patient. 
Recently, automatic image processing technologies have been 
applied to support tongue diagnosis in traditional Chinese 
medicine. These methods through the definition and use of 
geometry, texture and color features allow the inspection of 
tongue image [8, 9]. More specifically, in [8, 10] tongue images 
are managed in different color spaces with different metrics and a 
method for color recognition on tongue images, based on region 
partition is proposed. 

Li and Yuen, in [8], have addressed the problem of matching 
color images in medical diagnosis by presenting an ordered metric 
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in the coordinate space. Wang et al., in [10], proposed a new 
tongue color calibration scheme and used a model based on a 
vector gradient snake (GVF) that integrates the color information 
to extract the body of the tongue. The papers [11, 12] share 
applications that require the use of color, texture or shape features 
to classify tongue images. In particular, Chiu [11] built a 
computerized tongue examination system (CTES) based on a 
chromatic and structural algorithm that identifies the colors of the 
tongue and the thickness of its coating. Guo [12] proposed a new 
operator for the color structure, the local binary pattern of the 
primary difference signal. The corresponding performances are 
evaluated on color, grayscale and color structure and fusion of 
color and texture features. Li and Liu developed a hyperspectral 
brush tongue imager and discussed the method of calibration of 
the spectral response [13]. This new approach to color analysis 
outperforms the traditional method, allowing significant areas of 
tongue substances and coatings to be reached. Each of these 
methods has its fair share of success, but also limitations 
regarding demands of precision and robustness. It is necessary to 
investigate on features to be used to obtain a better analysis of 
tongue images. 


4. Features for Tongue Diagnosis System 


Tongue inspection allows an immediate diagnosis of some 
pathologies, and for this reason, it is widespread in clinical 
medicine. However, the potential of this examination is limited in 
traditional diagnosis. The first reason is that the tongue is visually 
observed by the human eye instead of being analyzed by a 
quantitative digital instrument. Secondly, the evaluation process is 
subjective, being linked to the experience and knowledge of the 
doctor making the diagnosis. Obviously, subjectivism is overcome 
with the introduction of computerized systems that allow tongue 
analysis to become objective and repeatable by performing tongue 
image analysis. 

Computerized tongue diagnosis system support the storage and 
transmission of digital data, as well as image analysis. The typical 
schema of such a system is reported in Fig. 3 and consists of four 
different phases: image acquisition, preprocessing, features 
analysis and classification. 

The acquisition of digital images constitutes the first phase in the 
computer vision system, including many techniques used to 
acquire the image to be analyzed. Each of those techniques 
presents advantages and disadvantages that make it more or less 
suitable depending on the case. 

The most used methods for automated non-invasive diagnosis 
include photography, confocal scanning, laser microscopy 
(CSLM), ultrasound, magnetic resonance imaging (MRI), optical 
coherence tomography (OCT), multispectral imaging, computed 
tomography (CT), positron emission tomography (PET), multi- 
frequency electrical impedance and Raman spectra. 

Preprocessing is the stage of detection used to improve the quality 
of images, trough color correction and removal of irrelevant 
noises that may cause inaccuracies in classification [42]. 

A first goal is to separate, for examples by edge detection 
techniques, tongue from background. 
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The quickest way to remove defects related to image acquisition is 
to use filters such as, medium filters, median filter or Gaussian 
filters [18]. These filters can be applied directly on grayscale 
images, and are applied to each channel on color images 
(marginal filtering). 


Acquisition mÐ Preprocessing mÐ Feature Analysis mm) Classification 


Geometry Diagnostic 
Features Classifier 
Color 
Correction 4 
Tongue Image J Texture 
Acquisition Features 
Tongue Image 
Segmentation 
Color 
Features 


Figure 3: Typical Computerized tongue diagnosis system 


In this paper we investigate about the selection of relevant 
features, useful for implementing medical-type applications for 
tongue analysis. These features, that will be detailed in next 
sections, are reported in Fig. 3 and are of three main types, 
namely geometric, texture and color. Much work has been done to 
accurately and effectively extract those features whose adoption is 
fundamental for a sound usability of the systems. In the following 
sections, we will report, distinguished by type, the most relevant 
features used for the implementation of medical applications in 
tongue analysis. 


4.1 Geometry Features 


Various proposals for computerized tongue diagnosis systems 
focus on the use of color and texture features of the images [14, 
15]. Despite the fact that oriental medicines use the form of 
tongue as an element of evaluation to discriminate the presence of 
possible diseases [16, 17, 18], there is poor literature on the 
geometric features functional to a computerized tongue diagnosis 
system. 

In this section, we will focus on the main geometric features 
useful to implement models that determine whether a given 
tongue belongs to a particular type of form. The most commonly 
referred tongue types are Triangular, Round, Ellipse, Square ad 


Rectangular (Fig. 4). 


d - Square e— Rectangular 


a - Triangular b - Round 
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Figure 4: Examples of tongue shape 


Using appropriately the geometric features it is possible to discern 
the shape of the tongue. Experimental results have shown that 
better classification accuracy can be obtained on tongue images 
previously separated by shape [19, 20]. In more details, images 
are first preprocessed to obtain the binary mask that separates the 
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body of the tongue from the outline [15]; subsequently geometric 
patterns are used to derive the shape of the tongue. The separation 
of region of interest is a common step that is also adopted in other 
medical fields [43]. 

Below, we report the geometric features useful for obtaining the 
type of tongue shape. 


Width. The width (w) feature (Fig.5) is the horizontal distance 
along the x-axis from a tongue's furthest right edge point (xmax) to 
its furthest left edge point (xmin): 


W = Xmax- Xmin (1) 


Length. The length (l) feature (Fig. 5) is the vertical distance 
along the y-axis from a tongue's furthest bottom edge (ymax) point 
to its furthest top edge point (ymin): 


l = Ymax- Ymin (2) 


Length-Width Ratio. The length-width ratio (lw) is defined as the 
ratio of a tongue’s length to its width: 
lw=— (3) 


Ww 


Smaller Half Distance. Smaller half distance (z) is defined as the 
shorter half distance of l or w (Fig. 5): 


min(Lw 
z  min(w) 


(4) 


2 


Xmin Xmax 


Ymin 
Q 


1 = Ymax - Ymin 
i 


Figure 5: Illustration of features (1), (2), and (4) 


Center Distance. The center distance d, (Fig. 6(a)) is the distance 
from w’s y-axis center point to the center point of y; : 


_ max(Yxmax)+MAax(Yxnin ) 
2 “ae G) 
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Center Distance Ratio. Center distance ratio der is ratio of d, to L: 
Ca 
Car — 7 (6) 


Area. The area (a) refers to the surface measurement of the pixels 
belonging to the considered tongue's image. 
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Circle, Square and Triangle Areas. Circle area (c4) and Square 
area (Sa) within the tongue are defined with the same radius of the 
smallest half-distance z (4): see Fig. 6 (b) and Fig. 6 (c). 

Triangle area (tą) is the area of a triangle within the tongue (Fig. 
6(d)). Considering Xmax and xmin as the right and the left points, 
and ymax as the bottom point of the triangle, we obtain: 


; 2 
cy — nr? 2 mz — m (=) (7) 
: 2 
Sqg=4z7=4 a) (8) 
bh 
ta — > (9) 


where b and A are respectively the base and the high of a rectangle 
triangle. 


Circle, Square and Triangle Area Ratio. Circle, Square and 
Triangle area ratio (Car, Sar, tar ) are the ratio of the considered 
area to a: 


Car — a (10) 

Sar = “a (11) 
ta 

tar = p (12) 


By, appropriately, using the above described features, various 
approaches for image analysis allowing the effective identification 
of tongue form have been proposed [21, 22]. 


Ymar 


(c) (d) 


Figure 6: (a) feature (5), (b) feature (7), (c) feature (8) and (d) feature (9) 
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4.2 Texture Features 


A texture can be defined as a two-dimensional image that 
represents information about the structure of a particular surface. 
Textures are variations in intensity or color, originating from the 
roughness of the surfaces of objects hit by a light source. For our 
purposes, the adoption of appropriate texture features, within an 
image coding process, facilitates the classification of the analyzed 
tissues as healthy or pathological. Textures can be divided into 
two main categories, statistical or related to spatial frequency 
[24]. 

Statistical approaches typically consider a discrete set of pixels 
around a fixed one in order to evaluate the properties that relate 
the single pixel with its neighborhood. Statistical. approaches 
evaluate various properties and are suitable when texture primitive 
sizes are dimensionally comparable with the pixel sizes. These 
properties include among others Fourier transforms, convolution 
filters, co-occurrence matrix, spatial autocorrelation and fractals. 
Through these properties, it is possible to identify different 
textures with respect to specific pre-selected parameters, based on 
distributions of the gray levels or color channel of the pixels, 
composing the image. 

Spatial frequency approaches evaluate the image in the domain of 
its frequencies to recognize the patterns [23, 24]. 

In fact, the tone and structural relations between the primitives of 
an image allow to identify the plots within the same image. The 
tone depends on the intensity of the pixels, in terms of gray values 
or color channel values, in primitives, whereas the structure is 
related to the spatial configuration among primitives. Therefore, a 
given pixel can be characterized by its tone and position 


properties. Primitive texture means a contiguous set of pixels 
characterized by a certain tone and can be described by means 
intensity, maximum and minimum intensity, size and shape. 
Among all the statistical approaches one of the most adopted 
provides the definition of the co-occurrence matrices at the gray 
level: it is based on the estimation of the statistics of the second 
order of the spatial arrangement of the gray level values. 

A co-occurrence matrix [25] is a square matrix in which the 
elements represent the relative frequency of occurrence of pairs of 
gray level values of pixels separated by a certain distance in a 
given direction. Formally, the elements of a co-occurrence matrix 
can be defined as: 


(a,b) e N x N:I(a,b) = gi 
I(a+dx,b+dy) = gs (13) 
(a+dx,b+dy) €NxN 


Ca(91, 92) = 


where I(a, b) denotes a square image with a fixed number on gray 
values, gı and gz are two gray levels of interest and |- | is the 
cardinality of a set. Important descriptors are obtained from co- 
occurrence matrix. One of the most relevant is Ry , often used to 
measures the smoothness or the homogeneity of an image. It is 
defined as: 


Ry = Xa Dias C? (91, g2) (14) 


where C(g4, g2) is a normalized co-occurrence matrix. 
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As previously stated, in Oriental Medicine different areas of 
tongue reflect the functioning of internal organs (Fig. 1). 
Therefore, to intercept possible pathological changes, it is possible 
to refer to the analysis of different areas of the tongue like Tip, 
Center, Root, and both Left and Right edge of the tongue. In each 
area it is then possible to investigate more portions of pixels to 
obtain a more in-depth sampling. In the hypothesis of reasoning 
on a single Region of Interest (ROI) for a single inspection area, it 
is possible to use (14) and (15) to define a set of texture measures. 
Starting from co-occurrence matrix it is possible to obtain the 
spatial gray-tone dependency matrix G = [p(i,j)], where each 
[p(i,j)] is evaluated considering the mean p(i,j; d,0), i.e. the 
probability of a pixel pair from gray tone i to j with distance d 
and direction specified by the angle 0 [26]. 

Texture features are relevant in identifying dusty coating of a 
tongue. Visually, a dusty coating is uniform, smooth on the 
surface with slow transitions of gray tones. Distinct areas of the 
tongue may be different in texture, but also in color: a dusty 
coating often shows white and yellow colors. More formally, 
given the spatial gray-tone dependency matrix and the quantized 
gray levels of an image, Ng, texture features are analyzed by 
evaluating specific parameters such as: the angular second 
moment, the contrast, the correlation, the variance and the 
entropy. 


Angular Second Moment: 
Ng oN. s 
ASM =X," 2; PG)? (15) 
Contrast: 
Ng-1 2 Ng Ng bx 

= a Uu EARD) (16) 

li-jl=n 
Correlation: 
Corr = EE ODD wen, "m 


Ox'Oy 


Where uy Hy represent the mean of p, e py and a, e oy represent 
the standard deviations of p, e py. 


Var = Y," L,’ (i — ux)? pj) (18) 
Entropy: 
E = Xj? Xj" pii j) logp(, j) (19) 


Therefore, human perception detects the messy coating by 
observing the properties of fine and invariant surface texture. 
Conversely, with Computerized Tongue Diagnosis Systems the 
dusty coating can be detected by smooth transitions of gray tones. 
The pixels appear to be highly correlated and characterized by low 
contrast values in shades of gray. 
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4.3 Color Features 


Human diagnosis can be effectively supported by the automatic 
analysis of images [28] that is in charge of extracting appropriate 
features, such as color. The representation of color features can be 
made in different color spaces [30]. Color space is a method by 
which color can be displayed. Typically, for automatic image 
analysis, the most used color spaces include RGB, HSI, CIExy, 
CIELUV and CIELAB. 

Usually, a color is defined by means of three parameters 
describing the position of the color within the adopted color 
space. An image is commonly represented with a two-dimensional 
pixel matrix in which each pixel is composed of three parameters, 
i.e. red, green and blue (RGB). The RGB color space is often used 
in computer applications as no transformation for the screen 
display is needed. Known the parameters of a color in a space, the 
representation in other color spaces can be obtained through 
appropriate transformations, see [31] for further details. 

As an example, HSI color model uses hue, saturation, and 
intensity to describe the features inside the images. More 
specifically, given a generic pixel or a ROI of an image, Hue (H) 
is the color type of the pixel, Saturation (S) is the degree to which 
a certain color is mixed into other colors, and Intensity (I) is the 
brightness of the considered pixel. 

The representation in the HSI space can be obtained starting from 
the RGB one, by applying the following transformations: 


H= arccos| [Sto S GRE j (20) 
[(R-G)?+(R-B)(G-B)] /z 
3 F 

p (R+G+B) (22) 


3 


The human eye perceives light through three types of conical cells 
that intercept the peaks of spectral sensitivity of the wavelengths 
respectively in the short ("S", 420 nm - 440 nm), medium ("M", 
530 nm - 540 nm) and long field ("L", 560 nm - 580 nm). These 
values represent the so-called tristimulus: this triple can represent 
the quantity of three primary colors in a three-color and additive 
color model. The CIE XYZ color space represents the color as 
visible to the human eye, so CIE XYZ is a device-invariant 
representation of color. When judging the brightness of different 
colors, humans tend to perceive light within the green parts of the 
spectrum as brighter than red or blue light of equal power. 

The CIE model exploits this assumption by setting Y as 
luminance. Z is almost equal to blue, or the response of the cone 
S, and X is a mix of response curves chosen to be non-negative. 
The Y setting for luminance has the useful result that for each 
given Y value, the XZ plane will contain all the possible 
chromaticity at that luminance. 

It is well known that color tongue gives important information 
about the possible presence of specific pathologies [15, 27]. 
Various approaches have been proposed in order to exploit color 
features in tongue medical images. One of the most important 
approach has been proposed in [33, 34]. The technique defines a 
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kind of color codebook in which the colorimetric features of the 
images associated with various diseases are mapped. In more 
details, color features from each pixel are extracted and assigned 
to 1 of 12 colors symbolizing the tongue color gamut. The tongue 
color gamut represents all possible visible colors on the tongue 
surface. Pixels or ROI of the tongue image, will be associated to a 
vector composed of 12 features, which can be used in the 
subsequent classification phase. Figure 7, taken from [33], shows 
within the red boundary, the CIE chromaticity diagram and 
highlights the black boundary in which lies the 98% of the points. 
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CIEx-axis 
Figure 7: The 100% and 98% tongue color gamut in CIE color space [33] 


Starting from these considerations the colors defining the gamut 
have been identified (Fig.8). 


C (Cyan) R (Red) B (Blue) P (Purple) DR (Deep red) LR (Light red) 
LP (Light purple) LB (Light blue) BK (Black) GY (Gray) W (White) Y (Yellow) 


Figure 8: 12 colors representing the tongue color gamut [33] 


In addition to the approach that involves the use of color tongue 
gamut, it’s possible to evaluate the mean and standard deviation 
of a selected ROI. Some recent proposals have highlighted how 
Multiple Instance Learning approaches are able to obtain 
interesting classification performances in applications on medical 
images [35, 37]. In [36] the authors proposed a mixed integer 
nonlinear formulation solved with MIL approach; the proposed 
algorithm was applied to a set of color images (Red, Green, Blue, 
RGB) with the objective of classifying the images containing 
specific pattern. Multi-instance learning (MIL) is a recent 
machine learning paradigm that is proving suitable for analyzing 
medical images and videos. MIL algorithms detect relevant 
patterns in images or videos based only on the labels of classes 
globally assigned to images or videos. Therefore, supervision is 
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based on global labels, and the training phase of MIL algorithms 
does not require tedious manual segmentations. Proposals based 
on MIL approaches are attracting increasing interest from the 
Medical Image and Video Analysis (MIVA) community. The 
need for tools allowing to construct predictive models capturing 
disease progression is a priority [44]. For the same reason, great 
attention is devoted to effective solutions ensuring accuracy in the 
identification of groups of similar genes or patients [45] and 
capturing valuable insights from medical sources [46]. 


5. Conclusion and Future Work 


In this paper, we focused on the features most commonly used to 
perform automatic analysis of tongue images. This work has been 
motivated by the observation that in many research works on 
image processing in medical field, features are often undefined 
and a clear formulation of used objective methods is missing. Our 
aim was to give a contribution to tongue image analysis and 
classification by an in depth analysis of relevant features. 

Using appropriate features allows, through approach like Artificial 
Neural Networks (ANN) and Support Vector Machines (SVM), to 
perform an effective automatic classification between images of 
healthy tongue and of tongue with pathologies [38]. The use of 
ANN seems useful to propose more general model to map health 
data [39]. Therefore, it is quite evident that reasoning about the 
most relevant and useful features for automatic classification 
methods is a mandatory step. 

In computer vision systems, the evolution of camera technology 
even on wearable devices opens to the possibilities of creating 
self-diagnosis systems. Many different sophisticated systems 
working on imaging analysis have been proposed in the recent 
literature in different areas, such as health care [40, 41]. As for 
future work, the aim is to further investigate relevant features 
useful for automatic classification methods in specific domain, 
and take advantage of machine learning technique for an effective 
classification. More specifically, the authors plan to perform a 
detailed comparison of the approaches in the literature and 
provide insights on the best features predictors. 
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ABSTRACT 


Electronic medical record (EMR) systems have now been widely 
adopted to support medical workers. There also has been much 
interest in the machine-based generation of clinical pathways that 
can utilize sequential pattern mining (SPM) to extract them from 
historical EMR systems. However, the existing methods do not pro- 
tect individual privacy, even though they involve sensitive medical 
data. To ensure the privacy of individual data, this paper describes 
two algorithms that deploy differential privacy by adding noise 
during calculations in the SPM considering time interval for guar- 
anteeing privacy. The proposals can limit the amount of added 
noise by adding noise to the frequency calculations of only a part of 
candidate closed sequences. Experiments on real medical datasets 
show that our proposal can ensure the robust and high utility of 
mining process even with minimum privacy budget and amount of 
added noise. 


CCS CONCEPTS 


* Information systems — Data mining; * Security and pri- 
vacy — Privacy protections. 
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1 INTRODUCTION 


Electronic medical record (EMR) systems have now been adopted 
by most large-scale hospitals to support medical workers via simple 
record-keeping procedures. In addition, the secondary uses of EMR 
data have attracted attention with respect to the standardization 
of medical treatments. Medical workers tend to use clinical path- 
ways as guidelines for their medical actions. A clinical pathway 
defines the typical flow of medical treatment for each disease and 
is generated conventionally by medical workers drawing on their 
experience. 

There have been several methods for the machine-based gen- 
eration of clinical pathways that can utilize sequential pattern 
mining (SPM) to extract the pathways from historical EMR sys- 
tems [9, 10, 14]. In our previous work, we proposed a method called 
T-CSpan that extracts frequent sequential patterns from EMR sys- 
tems to generate clinical pathways automatically with handling 
time intervals and the efficacy of medicines [10]. 

However, releasing clinical pathways risks compromising in- 
dividual patient privacy. Due to the changes in new discovered 
medicines or medical equipment, the typical clinical pathways may 
change as well. Therefore, malicious users, who can get the dif- 
ferent outputs of the algorithm via changing the search condition, 
may identify individual’s records. Some approaches to this issue 
have considered anonymizing the data before the mining opera- 
tions [11, 15]. However, it has been shown that even anonymized 
data can become reidentifiable via public data or other data mining 
algorithms [7, 13]. 

An alternative approach proposes differential privacy as a way 
to address such issues efficiently [5]. It offers strong theoretical 
guarantees on the privacy of released data by adding a carefully 
chosen amount of noise to the analyzed results. Differential privacy 
ensures that the output of computation is insensitive to a change 
in any individual’s record, thereby restricting privacy leaks from 
the results. Although there have been a number of existing works 
on differential privacy SPM [2, 17], however they only consider 
dataset of numerical data which is much simpler than EMR data that 
consists of complex medical-orders such as prescription, injection, 
surgery, etc. 

In this paper, we focus on the design of a differentially private 
frequent-sequence mining algorithm considering time interval for 
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EMR data. Given a collection of input medical-order sequences, the 
algorithm aims to find those sequences that occur in the collec- 
tion more frequently than a given threshold (minimum support 
or MinSup). During the calculations that generate the frequent se- 
quences, noise elements are carefully added with the aim of avoiding 
significant degradation of the algorithm’s utility. Taking these ad- 
vantages into consideration, we decide to deploy differential privacy 
to T-CSpan, realizing that there are significantly fewer calculations 
in T-CSpan than in other relevant algorithms as T-CSpan calculates 
only the frequencies of closed sequences [10]. Because a closed 
sequence has no super-sequences with the same frequencies, all 
sequences of it can be ignored during calculation. Moreover, only 
closed sequences that have true frequency closed to the MinSup are 
considered as candidate sequences to which noise is added. Thus, 
the required amount of added noise can be remarkably restrained. 
The contribution of this paper are as follows. 


e We propose two algorithms called T-CSpan-DP and an en- 
hanced version T-CSpan-DPe to achieve the goal of providing 
a SPM algorithm considering time interval with satisfying 
differential privacy for EMR systems. 

e The proposed algorithms are evaluated using real medical 
data. The experimental results identify the better algorithm 
T-CSpan-DPe with respect to high utility and small amount 
of added noise. 

e It is demonstrated that T-CSpan-DPe's performance is robust 
as it is not affected by the parameters controlling privacy 
budget and noise used in the algorithms. As a result, the 
proposed algorithm has a high potential to be applied to pri- 
vate typical-pathway-generation from EMR system because 
it can maintain high utility with low privacy budget and a 
small amount of noise. 


The remainder of this paper is organized as follows. Related work 
is reviewed in Section 2. The key definitions and theorems of dif- 
ferential privacy are introduced in Section 3. The proposed method 
is described in Section 4. Experimental evaluation of the proposed 
method is discussed in Section 5. Conclusions and ideas for future 
work are summarized in Section 6. 


2 RELATED WORK 


This section gives a brief review of sequential pattern mining (SPM), 
and SPM with differential privacy. 


2.1 Sequential Pattern Mining 


A well-known SPM algorithm is a Priori-based frequent pattern- 
mining algorithm [1]. However, it is very time-consuming with 
large data sets and generates many irrelevant patterns among its 
results. To exclude irrelevant patterns, PrefixSpan [8] was proposed 
to mine the complete set of patterns while reducing the effort of 
candidate pattern generation by exploring prefix projection. To 
improve efficiency further, CSpan [12] was proposed for mining 
closed sequential patterns. This algorithm uses a pruning method 
called occurrence checking that allows the early detection of closed 
sequential patterns during the mining. 

Initially, the proposed method of Agrawal et al. [1] did not con- 
sider the time interval between items. For example, the injection 
was performed on January 1, 2019, the sequence for performing 
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surgery the next day, and the sequence for performing surgery 
three days after the injection was regarded as the same sequence. 
Chen et al. proposed a mining method called TI-SPM for sequences 
where the time interval is important, such as medical instructions, 
which should treat the above two sequences as different things [4]. 

T-PrefixSpan [14] is a method to extract frequent sequential pat- 
terns from EMR logs that considers time intervals and the efficacy of 
medicines. Moreover, because time intervals were incorporated, T- 
PrefixSpan is functionally rich because it can offer medical workers 
more valuable information about the minimum, maximum, average, 
median, and most frequent value of time intervals between succes- 
sive medical treatments. T-CSpan [10] further improves the speed 
performance by applying the idea of mining only closed sequences. 


2.2 Differentially Private Sequential Pattern 
Mining 

Differential privacy is now considered the standard approach to pri- 

vate data analysis [5, 6]. It has been shown to be resistant to compo- 

sition attacks, in which an adversary uses independent anonymized 

data to breach privacy [7]. Detailed statistical aspects of differential 

privacy have also been studied [16]. 

There are several works on differentially private frequent-sequence 
mining. Bonomi et al. [2] propose a two-phase differentially private 
algorithm for mining frequent consecutive-item sequences. The 
first phase utilizes a prefix tree to find candidate sequences, then 
leverages a database-transformation technique to refine the support 
of candidate sequences. Xu et al.[17] utilize sample databases to 
estimate the sequences that are potentially frequent, then reduce 
the number of candidate sequences. However, all existing works 
consider only simple datasets of numerical data. In our work, the 
dataset contains longer sequences of more complex items, such as 
prescriptions, and surgical procedures. Moreover, existing works 
have focused on Top-k mining in which the number of output 
results is fixed, hence the amount of added noise can be easily con- 
trolled. However, in our work, the number of output results are var- 
ied and depends on a predefined threshold MinSup. The extracted 
sequences have the frequencies greater than MinSup. Therefore, 
increasing MinSup will decrease the number of extracted sequences 
and vice versus. 

Our work is based on T-CSpan [10], which is the fastest algorithm 
for generating clinical pathways that considers the time interval 
between successive items in EMRs. T-CSpan mines closed sequen- 
tial patterns using an occurrence-checking method that excludes 
duplicated sequences in the mining process. 


3 PRELIMINARIES 


Differential privacy has become a de facto standard for privacy 
considerations in private data analysis [5, 6]. Formally, differential 
privacy is defined as follows. 


DEFINITION 1. e-differential privacy 
A private algorithm M satisfies e-differential privacy if and only if, 
for any databases Dı and D» that differ by at most one record and 
for any subset of output S, 

Pr [M(D1) € S] € exp(e) x Pr [M(D2) € S], 

where the probability is taken over the randomness of M. 
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An essential concept for guaranteeing differential privacy is 
the sensitivity. It is used to measure the maximum change in the 
outputs of a function when any individual’s record in the database 


is changed. 


DEFINITION 2. sensitivity Af 
Given any function f : D — R” for any databases Dı and D» 
that differ by at most one record, the sensitivity of function f is 
Af = maxp, p, llf D1) - f (22)1l. 


For example, for the function of counting the male students in a 
class, the sensitivity becomes 1. 

The Laplace mechanism has been proposed [5]. And it is proven 
that differential privacy can be achieved by adding noise drawn 
randomly from Laplace distribution. 


THEOREM 1. Laplace mechanism 
For any function f : D — R” with sensitivity Af, the algorithm 
M(D) = f(D) + Lap(A) 
satisfies e-differential privacy, where Lap(A) follows the probability 
H A A 
density function Pr[x|A] = drexp(- 1), where À — ar, 
Sequential composition and parallel composition are used for 
supporting multiple differentially private calculations[6]. 


THEOREM 2. Sequential composition 
Let Mi, M2, .... My be k algorithms, each provides ej-differential 
privacy. A sequence of algorithms Mj (D) over database D provides 
(È; €i)-differential privacy (i = 1, ..., k). 


THEOREM 3. Parallel composition 
Let Mi, M2, .... My be k algorithms, each provides ej-differential 
privacy. A sequence of algorithms Mj(Di) over disjoint databases Di, 
.., Dy provides max(ej)-differential privacy (i = 1,...,k). 


4 PROPOSED METHODS 


First, we define the concepts necessary for the introduction of the 
proposed algorithm. We then explain the algorithms that extract 
typical sequences while achieving differential privacy. Dealing with 
medicines and their efficacy and other detailed information are 
discussed in [10]. 


4.34 Handling Time Intervals between Items 


DEFINITION 3. T-item (i, t) 
Let I be a set of items and t be the time that item i occurs. The T-item 
(i, t) is defined as the pair comprising i and t. 


DEFINITION 4. T-sequence s and O-sequence Os 
Let T-sequence s be a sequence of T-items such that 


s =< (i1, t1), (i2, t2), ..., (in, tn) >y 


T-items that occur at the same time shall be arranged in dictionary 
order. Furthermore, ifn is the length of T-sequence s, the O-sequence 
of s is defined as the sequence Os =< ij, i2,...,in >. 


DEFINITION 5. time-interval TI; 
Given a T-sequence s =< (i1, t1), (iz, t2), ..., (in, tn) >, let time- 
interval TI; be defined as 


TIk = thay — tk (k = 1,2,...,n-2,n—-1). 
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DEFINITION 6. T-sequential database D and O-sequential data- 
base Op 
Given a set of T-sequences, a T-sequential database D is defined as 


D = l(siq.s) | Sid» 8 € S], 


where the identifier siq for the elements of D is unique to each se- 
quence. Furthermore, let an O-sequential database Op be a se- 
quential database comprising the O-sequences configured from all 
T-sequences in D. Let Size(D) be Size(Op), i.e., the number of se- 
quences in Op. 


DzriNITION 7. T-frequent sequential pattern P 
Let MinSup (0 € MinSup < 1) be a minimum support and D be a 
T-sequential database. Given P =< i4, X1, i2, X2, ..., in-1, Xn-1, in > 
(where Vj ij is an item and Yk Xy is the set of five values: ming, mod, 
avez, med, and maxy.), a sequence Op =< ij, iz, ..., ip 1, ig > can 
be configured. 
P is defined as a T-frequent sequential pattern if Op is a frequent 
sequential pattern in an O-sequential database configured from D 
(ie, Sup(P) =| (SeglOp € Seq, (sig, Seq) € Op, where Siq is an 
identifier of Seq} |2 Size(Op) x MinSup ). 
Let Op be the O-Pattern of P. The set of five values are defined as a 
result of the following considerations. 
Given all T-sequences for which the O-sequences contain Op in D, 
let S be one such T-sequence, with S =< i},t},ij,ta,0i), ytm-Lim >- 
By using j1, j2, .... jn-1. jn, which fulfil both1 € jy < jo < ... < 


jn-1 < jn < mand iy =i" ,ig,,1 = i , sets of time-intervals can 


Jk Jk+ 


be configured: Setry,, Setry,,...,Setr1,_,, where TI, = ta - t 
In Xy = (ming, mod, avey, medy, max), the five values can be 


defined as 
(1) ming = min Setry, 
(2) mod, = the most frequent value in Setti, 
(3) ave, = the average of values in Setti, 
(4) med, = the intermediate value of values in Setr y, 
(5) max, = max Setry, 
DEFINITION 8. T-closed frequent sequential pattern A 
Given a T-sequential database D, let Y" be the set of T-frequent sequen- 
tial patterns extracted from D and let A be a T-frequent sequential 
pattern in Y. A is a T-closed frequent sequential pattern if there 
is no B in X. VA that 
(1) IfA’ and B' are the O-Patterns of A and B, respectively, then 
A' C B'. 
(2) Sup(A) < Sup(B), where we define the support for a T-frequent 
sequential pattern A as Sup(A) =| (s|sCS, (Siq, S) € D and 
Siq is the identifier of S in D} |. 
(3) If A and B are < aj, T1, a2, T2, ..., àn-1, Tn-1,4n > and < 


bi, Ty, b2, T, ..., bm-1, T4, 4i bm >, respectively, then < ji. j2, jn > 


that meets 1 < jj < j2 < .. < ja < mand ag = bj, ay, = 
b; 
J 


Ka €XISts. 


For example, consider extracting T-frequent sequential patterns 
from a T-sequential database such as the one described in Table 
1 under the minimum support MinSup = 0.4. The O-sequential 
database of D is shown in Table 2. The frequent sequential patterns 
on the minimum support MinSup = 0.4 are < a >, < b >, < d >, 
<a, b >, < b, d >and < a, b, d >. Because the frequent sequential 
patterns that have one item in Op are T-frequent sequential patterns 
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Table 1: T-sequential database D 


Sequence identifier T-sequence 
sı < (a, 1), (b, 3), (c, 7), (d, 10) > 
s2 < (a,1), (b, 4), (d,7) > 
83 < (a,2), (b, 6), (b, 9) > 
$4 < (a,2), (b,5) > 
$5 < (a,2), (b. 7) > 


Table 2: Op (O-sequential database of D) 


| Sequence identifier | T-sequence 
| s1 <a,b,c,d> 
| S2 <a,b,d> 

| S3 <a,b,b> 

| S4 «ab» 

| S5 «ab» 


in D; < a >, < b >, and < d > are T-frequent sequential patterns in 
D. Considering the time between item a and item b in the sequence 
< a, b >, the set of time intervals calculated from D is (2, 3, 3, 4, 5}. 
Considering the minimum, the most frequent value, the average, 
the median, and the maximum, < a, (2, 3, 3, 3, 5), b > isa T-frequent 
sequential pattern in D(2+3+3+4+5 = 17, [17/5] = 3). Similarly, 
if we calculate T-frequent sequential patterns from « b, d > and 
« a, b, d >, these are « b, (3, 5, 5, 5, 7), d > and « a, (2, 2, 2, 2, 3), 
b, (3, 5, 5, 5, 7), d >. If more than two different values are the most 
frequent values, their average is the most frequent value. Therefore, 
T-frequent sequential patterns in D under the minimum support 
MinSup = 0.4are<a>,<b>,<d>,<a, (2, 3, 3, 3, 5), b >, < b, 
(3, 5, 5, 5, 7), d >, and < a, (2, 2, 2, 2, 3), b, (3, 5, 5, 5, 7), d >, and 
T-closed frequent sequential patterns are < a >, < b >, < a, (2, 3, 
3,3, 5), b > and < a, (2, 2, 2, 2, 3), b, (3, 5, 5, 5, 7), d >. 


4.2 T-CSpan-DP Algorithm 


T-CSpan-DP is the following Algorithm 1, for which the O-sequential 
database of a T-sequential database D is Op, the O-sequence of a T- 
sequence S is Os, and the connection of a sequence A to a sequence 
Bis AB. Denote the N-th element of a set X as Xy, the N-th item of 
a sequence S as Sy and the time of occurrence for the N-th T-item 
of T-sequence A as TA,. 

In this paper, we represent an item in the form of a set of four 
segments of text (Class; Description; Code; Name). Class is the type 
of medical treatment, Description is the detailed record of the treat- 
ment, Code is a medicinal code representing the unique efficacy 
of the medicine used, and Name is the name of the medicine. An 
example of such an item is (prescription; internal medicine; 613; Fine 
Cefzon 10%). We then configure a T-sequence that comprises the 
T-items undergone by one patient until discharge from hospital 
and construct a T-sequential database from these T-sequences. 

The process in T-CSpan-DP is efficient because it utilizes an 
occurrence check only to add the closed T-sequences to the result. 
A closed T-frequent sequential pattern is a frequent sequence that 
has no super-sequence with the same support. Therefore, only the 
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longest frequent T-sequences need to be added, with any sequences 
of it being ignored unless they occur more frequently than the 
super-sequence. 

To meet the differential privacy requirement, noise is added to 
the true frequency of the generated closed T-sequence, as specified 
in line 5 of Algorithm 2. The noise is generated according to a 
Laplace distribution. Furthermore, to minimize the added noise, we 
only add noise to closed sequences that have a frequency close to 
the MinSup. The implicit idea is that as we added noise to the true 
frequency of the sequences, some of the true frequent sequences 
may be excluded from the outputs. Hence, we include the sequences 
that have the true frequencies closed to the MinSup into consider- 
ation. It is achieved by introducing the f parameter. The number 
of such sequences is controlled via the f parameter at line 4 of 
Algorithm 2. 


THEOREM 4. Differential privacy 
T-CSpan-DP satisfies | x e-differential privacy, where l is the number 
of times Algorithm 2 is invoked. 


Proof: In Algorithm 2, for the computation of the frequency of 
sequences, the sensitivity is 1. So, adding noise Lap( 1) in computing 
the sequence’s frequency in Algorithm 2 satisfies ¢-differential 
privacy. As Algorithm 2 is invoked by / times from Algorithm 
1 (line 7 to 9), from Sequential composition theorem (Theorem 2), 
T-CSpan-DP satisfies | x e-differential privacy. l is proportional with 
the number of single frequent items and depends on the predefined 
MinSup as the decreasing MinSup will increase | as more items can 
be considered as frequent ones. 


4.3 T-CSpan-DPe Algorithm 


As T-CSpan-DP algorithm satisfies | x e-differential privacy, it is 
still inefficient as the added noise may be large. We further pro- 
pose an enhance algorithm called T-CSpan-DPe. Instead of adding 
noise during generating closed sequences as in Algorithm 4, we 
only add noise to closed sequences that were generated after all 
(Algorithm 3, line 10 to 13). We also adjust the number of candi- 
date sequences by using parameter f) as in T-CSpan-DP. 


THEOREM 5. Differential privacy 
T-CSpan-DPe satisfies e-differential privacy. 


Proof: For the computation of the frequency of sequences, the 
sensitivity is 1. So, adding noise Lap( 1) in the closed sequence's fre- 
quency in Algorithm 3 make T-CSpan-DPe satisfies e-differential 
privacy. 


4.4 Proposed Methods at other Systems 


Although our proposed algorithms have focused on EMR data, they 
also can be extended to other types of systems containing private 
data, where time interval is important. For example, our proposed 
methods can be applied to the medical insurance claim data which 
comprise the specifications of medical fees charged to health in- 
surers. So, the trend of prescription while guaranteeing individual 
privacy can be analyzed. Moreover, the proposed methods also can 
be applied to retailing data where the habits, interests and the tim- 
ing of shopping of clients can be studied; or to traveler records at 
travel agencies where both the places and the time a traveler spent 
at that places can be derived. 
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Algorithm 1: T-CSpan-DP 


Input :D:a T-sequential database, seq: a sequence, 


MinSup: a minimum support, e: privacy budget, f: 
range parameter 
Output: P: the set of T-frequent sequential patterns with 


noise 
1 begin 
, — rt 
2 D lseq= ODI seq: 
3 if seq! = null then 
4 |. P € GetProperTime(seq, D |seq. D’ |seq): 
5 SingleFregItems + {sfi | (s € D’ |;eg,sfi € 
s) ^ (Sup(sfi) = Size(D) x MinSup)} ; 
6 CSPseq — 0; 
7 for sfi € SingleFreqItems do 
8 CSseq €- GenClosedSeqs — DP(D |sfi 
sfi, MinSup, e, p); 
9 CSPseq — CSPseq U CSseq; 
10 for csp € CSPseq do 
11 T-CSpan-DP(csp, D ICSPseq> MinSup, e, f); 
12 Function GetProperTime (seq,D |seq,D’ |a.) 
13 if length(seq) == 1 then 
14 return seq; 
15 K € {k |< siq. s >E D |seq. Os € D’ |seqsk C 
s, Oy == seq} ; 
16 T = (0,0... OX T |= length(seq) — 1) ; 
17 fork € K do 
18 for i = 0, ..., length(k — 1) do 
19 | Ti € T(ki+1) — T(ki); 
20 W =< seqo, S€q1, --- Sedlengrh(seq)-1 >; 
21 for i = 0,..., length(seq) — 2 do 
22 Ti = an arbitrary function to exclude outliers 
from Tj; 
23 min; = min Tj; 
24 mod; =the most frequent value of Tj; 
25 ave; =the average of values of Tj; 
26 med; =the intermediate value of Tj; 
27 max; = max Tj; 
28 Xj = (minj, modi, avej, medi, maxi); 
29 W =< 
seqo, ..., Seqi, Xi, Seqi+1---» Sedlength(seq)-1 >: 
30 return W; 


5 EXPERIMENTAL EVALUATION 


To the best of our knowledge, T-CSpan-DP and T-CSpan-DPe are the 
first algorithms to support the mining of medical order sequences 
to generate typical pathways with time interval that guarantee 
differential privacy. We performed experiments to evaluate the two 
algorithms with respecting to the amount of added noise and utility 
using real datasets with a variety of sizes and data distributions. 
Next, we investigate the effects of e and f on the algorithm's utility. 
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Algorithm 2: GenClosedSeqs-DP 


Input :PD: a projected T-sequential database, seq: a 
T-sequence, MinSup: a minimum support, e: 
privacy budget, f: range parameter 

Output:CS: noisy-added closed T-sequences with prefix seq 

1 begin 

2 CSseq — 0; 

3 F e {f : (s c PD, f es) ^(Sup(f) = 

Size(PD) x MinSup)}; 

4 if (Sup(seq) 2 Size(PD) x MinSup x (1— B)) ^ (F + 0) 


then 

5 noise Sup(Seq) = Sup(seq) + Lap(); 

6 if noise Sup(Seq) 2 Size(PD) x MinSup then 

7 if seq.closed() == true then 

8 E CSseq — seq; 

9 for f € F do 

10 CSseq — 
GenClosedSeqs — DP(PD, seq + f, MinSup) 
//seq + f: append f to seq; 


Algorithm 3: T-CSpan-DPe 
Input :D:a T-sequential database, seq: a sequence, 
MinSup: a minimum support, e: privacy budget, f: 
range parameter 
Output: P: the set of T-frequent sequential patterns with 
noise 
1 begin 
2 D’ [seq ODlseg’ 
3 if seq ! = null then 
4 | P € GetProperTime(seq, D |seq. D’ |seq): 
5 SingleFreqItems + {sfi | (s € D’ |;eg,sfi € 
s) ^ (Sup(sfi) 2 Size(D) x MinSup x (1— B))) ; 
6 CSPseq — 9; 
7 for sfi € SingleFreqItems do 


8 CSseq €- GenClosedSeqs(D |sfi, sf'i, MinSup, B); 
9 CSPseq — CSPseq U CSseq; 

10 for csp € CSPseq do 

1 noise Sub(csp) = Sub(csp) + Lap(1); 

12 if noise Sub(csp) 2 Size(D) timesMinSup then 
13 | T-CSpan-DPe(csp, D ICSPseq> MinSup, e, f); 


5.4 Experimental Environment and Method 


The algorithms were implemented in Java and evaluated using 
various minimum support (MinSup) values, € and p. Experiments 
were conducted on a Linux Server with an Intel Xeon CPU E5-4650 
at 2.70 GHz and 64 GB RAM. 

The target clinical pathway data were medical treatment data 
recorded between November 19, 1991 and October 4, 2015 in the 
EMR ofthe Faculty of Medicine at the University of Miyazaki Hospi- 
tal. The target data for our experiments involved medical treatment 
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Algorithm 4: GenClosedSeqs 
Input :PD:a projected T-sequential database, seq: a 
T-sequence, MinSup: a minimum support, f: range 
parameter 
Output: CS: noisy-added closed T-sequences with prefix seq 
1 begin 
2 CSseq — 0; 
3 F e {f : (s CPD, f €s) A (Sup(f) = 
Size(PD) x MinSup)}; 
4 if (Sup(seq) 2 Size(PD) x MinSup x (1— B)) ^ (F + 0) 


then 
5 if seq.closed() == true then 
6 CSseq €- seq; 
7 for f € F do 
8 CSseq <— GenClosedSeqs(PD, seq + f, MinSup) 
//seq + f: append f to seq; 


Table 3: Characteristics of the two datasets 


Dataset | CFS | Tur-bt 


The number of sequences 271 514 
The average length | 27.39 86.5 
The minimum length 10 11 


The maximum length | 549 1999 


pathways for Cryptorchidism Fusion Surgery (CFS) and Transurethral 
Resection of a Bladder tumor (Tur-bt). We chose these two clinical 
pathways because CFS is representative of clinical pathways for 
which the flow of medical treatments is immobilized, whereas Tur- 
bt is a clinical pathway for which the flow is not well defined. The 
characteristics of the two datasets are given in Table 3. 

Utility Measures: To calculate the utility of the algorithm, we 
adopted the F-score metric, which is widely used for private data 
analysis [2, 3, 17, 18]. For each MinSup setting, we extracted the 
Original Results (OR) as the set of the longest frequent sequences 
having the highest number of items. These sequences were gener- 
ated without adding noise. We generated the Noised Results (NR) 
as the sets of longest frequent sequences generated by the proposed 
algorithm for various e and £ settings. We compared each Noised 
Results set to the Original Results by calculating its F-score. A higher 
F-score means higher utility. The reason for selecting the longest 
frequent sequences was because the feedback from medical work- 
ers indicated that they would contain all items needed for typical 
pathways. 


DEFINITION 9. F-score 
F-score is calculated via recall and precision defined as below. Supposed 
that OR = {s;|i=1,...,n} and NR = {sil = 1,...,m} then 


» pn LCM(si,s;) 
i=1 s;.length 


recall = 
ftcommon, sequences 
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n.. yc LCM(si,s;) 
i=1 sj.length 


precision = 
#common_sequences 


and the F-score is the harmonic mean of recall and precision: 


recall x precision 
F — score = 2x 


recall + precision 


Here, LCM(si, 8j) is the length of the longest common sequences 
of two sequences sj and Sj; stcommon, sequences is the total number 
of longest common sequences in comparison. 


5.2 Experimental Results and Discussion 


We conducted several experiments to evaluate the two proposed 
algorithms T-CSpan-DP and T-CSpan-DPe described in Section 4, 
and to evaluate the effectiveness of parameters e and £ to the perfor- 
mance of the proposed algorithms. We set the value of MinSup to 
0.1, 0.08, 0.06 and 0.05 for CFS dataset, and to 0.3, 0.25, 0.2, 0.15 for 
Tur-bt dataset. Here, generally the smaller the MinSup, the larger 
the outputs of the SPM algorithm as more number of sequences 
can be extracted as the frequent patterns. 


5.2.1 Evaluation of Two Proposed Algorithms. Figures 1 and 2 show 
the F-score results of the two proposed algorithms for the two 
datasets, with variable e budgets. A larger e indicates the bigger 
gap between the true and the noise frequent values. f was fixed to 
0.0. 

It is observed that the T-CSpan-DPe gained better F-score results 
than T-CSpan-DP in almost all cases. Especially, T-CSpan-DPe im- 
proved at most 32 points, which is corresponding to approximately 
50%, for CFS with MinSup = 0.08, e = 0.01 (0.95 vs. 0.63). Only 
apart from Tur-bt dataset with MinSup = 0.3 (Figure 2a) that T- 
CSpan-DPe slightly delivered worse results, however the gap is 
small, i.e. 3 points. 

Figures 3 and 4 present the absolute noise value added into the 
calculation of the two proposed methods for the two datasets. In 
all cases, there is more noise added in the T-CSpan-DP than in 
the T-CSpan-DPe. It is understandable that as in the T-CSpan-DP, 
the number of adding noises is in proportion with the number of 
frequent items (Algorithm 1, line 7 to 9). However, in the T-CSpan- 
DPe, adding noise occurs only at the final step, i.e. after getting all 
the frequent closed sequences (Algorithm 3, line 11 to 13). 

Moreover, when MinSup decreases, the amount of noise in- 
creases as more sequences can be considered as frequent patterns. 
In each MinSup, increasing € decreases the amount of added noises 
in both algorithms because the range if noise values becomes nar- 
rower. 

From the perspective of utility and amount of noise, the T- 
CSpan-DPe delivered better performance. Even in Tur-bt dataset 
and MinSup = 0.3, F-score was slightly worse, however, the amount 
of added noise was significantly smaller (about 43% of the T-CSpan- 
DP). 


5.22 Evaluation of Parameters. As T-CSpan-DPe gained better per- 
formance than T-CSpan-DP, in this section, we focus on evaluation 
of the effectiveness of parameters e and f) to the performance of 
T-CSpan-DPe. We changed the values of f to 0.0, 0.1 and 0.2. The 
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Figure 4: Amount of added noise of the Tur-Bt dataset 


higher the value of f, the more sequences can be considered as 
frequent patterns. 

Figures 5 and 6 show the F-score results of T-CSpan-DPe. From 
these figures, it is observed that T-CSpan-DPe is robust as the results 


0.01 0.02 0,03 0.04 0.05 
E 


d) MinSup=0.15 


are stable with the variants of e and £. In all settings of e, the F- 
score results are almost the same. Increasing £ slightly increases or 
decreases the results, however, the gap is not considerably large. 
The biggest difference was 7 points at CFS dataset, MinSup = 0.08, 
€ = 0.01 (Figure 5b). 
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Figure 8: Amount of added noise of T-CSpan-DPe for Tur-bt dataset 
Figures 7 and 8 present the amount of noise added during calcu- more sequences are counted. As a result, even minimum e and ff 
lation of T-CSpan-DPe. It is seen that the amount of added noise is are able to provide high utility. 


decreasing with the increasing of e, which is similar with the results 
at Section 5.2.1. However, more noise is added with the larger f as 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 


102 


Differentially Private Sequential Pattern Mining considering Time Interval for 


Electronic Medical Record Systems 


5.3 Experimental Evaluation Remarks 


We evaluated the two proposed algorithms, i.e. T-CSpanDP and 
T-CSpanDPe, using real data from EMR system. Although the two 
algorithms limit adding noise to only a apart of closed sequences, 
from the experimental results, T-CSpan-DPe was superior to T- 
CSpan-DP as it gained up to 50% better utility performance with 
smaller amount of added noise. The reason is that T-CSpan-DPe 
only adds noise after calculating frequency values of closed se- 
quences, while T-CSpan-DP adds noise during the calculating. The 
process of adding noise invoked in T-CSpan-DP is proportional to 
the number of frequent items, which is very large. Moreover, it is 
also confirmed from the experimental results that T-CSpan-DPe is 
robust as it is almost not affected by parameters, so it can provide 
high mining utility with minimum privacy budget. 


6 CONCLUSIONS AND FUTURE WORK 


In this paper, we proposed two algorithms for SPM considering 
time interval in EMR systems with satisfying differential privacy. 
To maintain high utility, noise was added only to the candidate 
closed sequences. From the perspective of utility and amount of 
added noise, compared to T-CSpan-DP, T-CSpan-DPe is a better 
choice because it obtains better utility while the amount of noise 
is kept smaller as noise is only added at final step deciding which 
sequences are frequent patterns. 

From the experimental results with a real medical dataset, it 
is verified that T-CSpan-DPe is a robust algorithm as the high 
utility performance was not affected by parameters. As a result, the 
proposed algorithm has high potential to be applied in practice that 
it is able to achieve high utility with minimum privacy budget and 
noise. 

In the future, we would like to evaluate the algorithm with other 
datasets. Moreover, we also plan to improve the algorithm via well 
controlling the amount of added noise while maintaining high 
utility, for example studying mechanisms other than the Laplace 
mechanism. We also would like to extend our work to other fields 
than EMR data, such as medical insurance claim data, retailing data, 
traveler record data. 
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ABSTRACT 


In this work we face the challenge of estimating a ship’s main- 
engine rotational speed from vessel data series, in the context of 
sea vessel route optimization. To this end, we study the value of 
different vessel data types as predictors of the engine rotational 
speed. As a result, we utilize speed data under a time-series view 
and examine how extracting locally-aware prediction models affects 
the learning performance. We apply two different approaches: the 
first utilizes clustering as a pre-processing step to the creation of 
many local models; the second builds upon splines to predict the 
target value. Given the above, we show that clustering can improve 
performance and demonstrate how the number of clusters affects 
the outcome. We also show that splines perform in a promising 
manner, but do not clearly outperform other methods. On the other 
hand, we show that spline regression combined with a Delaunay 
partitioning offers most competitive results. 
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1 INTRODUCTION 


Liner shipping companies can benefit significantly by improving 
ship scheduling and cost analysis in service route planning us- 
ing computational methods. Furthermore, since there is a strong 
demand for ships to reduce their emissions, a number of current 
research activities focus on estimating shipping emissions and de- 
veloping mitigating solutions to tackle the problem (e.g. [30]). In 
addition, volatility in fuel prices constitutes a major problem for 
shipping companies as fuel makes up for 60% of the overall ship 
operating cost [6]. As a result, modern ship management moves to- 
wards energy-efficient procedures and operations, aiming to reduce 
energy consumption for lowering management costs and thereby 
maintaining a competitive position in the market while reducing 
the corresponding environmental impact. 

Routing optimization has been a major problem in shipping 
industry for over three decades and remains one of the research 
topics of primary interest for the maritime community. Especially 
nowadays, when new technologies and new concepts, such as Big 
Data, Data Mining and Pattern Recognition and new methods of data 
acquisition (AIS data), are overthrowing traditional ways of science 
exploration, data-driven maritime research is gaining in attention. 
The automatic identification system (AIS) is an automatic tracking 
and self-reporting system for identifying and locating vessels by 
electronically exchanging data among other nearby ships, AIS base 
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stations and satellites. The widespread use of AIS allowed vessel 
tracking and increased the availability of ship trajectory data. 

The problem of optimal-route-planning takes into consideration 
the objectives of ship owners for energy consumption and on- 
time delivery of goods and the restrictions set by the regulatory 
framework (national regulations, IMO etc). Regardless the specific 
constraints, what makes the optimal-route-problem so challenging 
is the time-varying character of weather conditions during the 
voyage of the vessel. In this work the optimal-route-problem is 
mainly examined under the aspect of Fuel Oil Consumption (FOC) 
and an optimal-route is this that minimizes the vessel’s FOC for a 
given destination. 

As it is well known from ship-powering literature, FOC is closely 
related with the rotational speed (measured in revolutions per 
minute - RPM) of the main engine. In this connection, the optimal 
route problem could be significantly simplified if a good predictive 
model for RPM is available. To elaborate further along this path, this 
article summarizes the current status of our work to couple ship's 
velocity V with main-engine's RPM in the context of a non-convex 
regularized regression estimation problem and in conjunction with 
the fact that, as marine engineering points out, there is a strong 
correlation between these two factors. Coupling V with RPM will 
give ship operator the benefit of a tool that does not impose in- 
stallation requirements on the ship, like sensors for gathering data, 
instead it can be readily used by only getting from satellites the 
position of the ship, calculating its speed and getting the weather 
conditions at each time interval. In the same time, it is a first step 
towards integrating input from more sources (including weather 
and sea condition data) and allowing the creation of data driven 
models (black box models) that are able to predict and optimize 
vessel consumption. 

The rest of the paper is organized as follows: In section 2, a sum- 
mary of the literature related to the routing optimization problem 
in maritime industry is provided and analyzed. Section 3 presents 
the motivation behind the work of this paper and summarizes our 
initial exploratory experiments. Section 4 describes a formulation 
of the problem at hand and gives an overview of the proposed al- 
gorithm. Section 5 depicts and interprets the experimental results, 
combined with statistical testing. Finally, Section 7 provides the 
main conclusions of this work and outlines the next steps. 


2 RELATED WORK 


Following strong regulatory and societal demand for ships to re- 
duce their emissions, current research activities focus on estimating 
global shipping emissions and develop mitigating solutions to tackle 
the problem, e.g. [30]. In addition, the increase and volatility in fuel 
prices constitute a major problem for shipping companies as fuel 
contributes approximately 60% to the overall ship operating cost 
[6]. As a result, shipping companies move towards taking on board 
energy efficient procedures and operations for reducing energy 
consumption and thereby maintain their competitive position in 
the market as well as reduce the environmental impact. There is 
a plethora of theoretical papers related to ship route optimization, 
starting as early as 1960 [31] and evolving from using simple con- 
cepts, such as the so-called isochrone and isopone methods [7], to 
more elaborate and rigorous approaches, such as optimal control 
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[29], dynamic programming [24], graph theory [25] and evolution- 
ary algorithms [26]. 

Numerous studies in different disciplines have been undertaken 
to predict the fuel consumption by using ANN models [27] . ANN 
is found to be the domain for many successful applications involv- 
ing prediction tasks, such as modelling and prediction of energy- 
engineering systems [22], prediction of the energy consumption of 
passive solar buildings [10], developing energy system and forecast 
of energy consumption [1], and analysis of emissions reduction 
[20]. There are also some relevant reports of ANN's being used for 
implementing decision-support systems in various subjects, such 
as solving the buffer allocation problem in reliable production [28], 
developing environmental emergency decision-support systems 
[24], risk assessment on prediction of terrorism insurgency [11] 
and modeling of simulation metamodel [2]. ANNs have been used 
to predict specific fuel consumption and exhaust temperature of a 
Diesel engine for various injection timings [21]. 

The optimization objectives in the ship routing problem are usu- 
ally the minimization of the voyage time, fuel consumption and 
voyage risk. The approaches, which have appeared so-far in the 
pertinent literature, can be classified in two large categories: (a) 
Vessel-based optimization, where we optimize a given route with 
respect to vessel characteristics, e.g., vessel speed, main-engine rota- 
tional speed, trim, roll, heave and pitch motions, and (b) Condition- 
based optimization, where we optimize a given route by taking into 
account environmental data, e.g., wind (speed, direction), wave 
(height, frequency, direction), currents, etc. The aforementioned 
methods utilize techniques that can be separated into three main 
categories: (a) Analytical approaches trying to tackle the problem 
with the use of exact(NP-complete) and/or heuristic algorithms like 
label - setting algorithms , non-linear integer programming, simu- 
lated annealing [15]; (b) Data-oriented approaches that combine 
vessel-trajectory data, gathered from sensors or satellites (AIS data), 
with Machine- and Deep-Learning algorithms [23]; (c) Approaches 
where ML (machine learning) methods, e.g., Box Models: White, 
Black and Grey Box Models (WBM, BBM, GBM), are combined 
with analytical methods, e.g., the equations of motions of a freely 
floating body moving with constant forward speed (WBM), in order 
to increase the accuracy of a regression method in a ML model 
(BBM) [3]. 

Finally, methods that refine the voyage grid (map) in areas of 
critical interest involving, e.g., weather conditions, emission control 
areas (ECA, SECA: sulfur-oriented ECA's), high-risk zones (piracy), 
and choose from a set of optimal routes the best in terms of FOC 
and safety (PARETO optimal solutions, Genetic Algorithms) [12] 
must also be referenced. 


3 MOTIVATION 


The motivation for the current work came directly from a business 
need for the optimisation of the ship engine usage (RPM) in relation 
to FOC. Based on this requirement, we attempt first to perform an 
exploratory analysis on a real dataset in order to understand the 
nature of the FOC - RPM relation. Given the rich and composite 
feature set of the FOC prediction problem, before training any multi- 
parameter prediction model it is important to study the effect of 
each parameter separately. The exploratory analysis was performed 
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on a dataset comprising 10° observations from multiple ships and 
allowed us to determine which feature is most appropriate for the 
prediction of FOC (Fuel Oil consumption). Initial experiments on the 
complete dataset were performed using feature selection algorithms 
in order to rank the features by importance. The Random Forest 
regression was used for selecting the most informative features. 
The eight top ranked features on the basis of RF regression are 
depicted in Table 1. 


Feature importance 
Importance] Description 


Feature Name 


RPM 0.98353 Main engine revolutions per minute. 

STW 0.00365 Speed through water. 

Speed Overground 0.00266 Speed of the ship with respect to the 
ground. 

Apparent wind || 0.00133 The relative speed, i.e., the speed expe- 

speed rienced by an observer or a measuring 


instrument on the ship. 


Port mid draft 0.00075 Draft amidships on the port side of the 
ship; port is the left-hand side of a vessel 
facing forward. 

| STBD mid draft 0.00042 Draft amidships on the starboard side of 
the ship; starboard is the right-hand side, 
facing forward. 

Mid draft 0.00075 Draft amidships. 


Apparent wind an- || 0.0007 The relative angle, i.e., the angle expe- 
gle rienced by an observer or a measuring 
instrument on the ship. 


Table 1: The top ranked features by importance, using Ran- 
dom Forest regression. 


It is clear from Table 1 that RPM plays a pivotal role in the 
prediction of FOC. Based on this finding, it seems reasonable to 
develop a predictive model for FOC using RPM only, since it has 
maximum importance and is much easier to measure than other 
features (e.g. wind speed or draft). By measuring the correlation 
between RPM and each of the remaining seven features of Table 1, 
using PPMCC (Pearson Product Moment Correlation Coefficient), 
showed an extremely high linear relation (0.92) between RPM and 
speed overground a result that is also aligned with Figure 1 which 
confirms a strong linear relationship between the two variables. 
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Figure 1: A sample plot of Main-Engine’s rotational speed 
(RPM) and observed speed during a vessel’s route (courtesy 
of DANAOS Shipping Co.) 
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Correlation 
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Figure 2: The correlogram of RPM and V during a vessel's 
route 


A survey of the pertinent literature on Naval Architecture and 
Marine Engineering shows that there is no robust, low complexity, 
analytical relation between RPM and V. On the other hand, signif- 
icant work has been done in complex, time-consuming methods 
that perform well, while taking into account various related factors, 
such as geometric and hydrodynamic ones [17]. Thus, our effort 
of finding a way to efficiently predict RPM from V utilizing data- 
driven based methods is well justified. From ship hydrodynamics 
itis well known that the , where Q is the torque absorbed by the 
propeller ofthe ship. Then, recalling standard resistance and propul- 
sion theory of ships, we can say that, for a given ship, the torque 
Q is a function depending exclusively on the ratio V /RPM and, as 
a result, predicting RPM from V is a decisive step for predicting 
power and thus optimising fuel cost. 

. This claim is also strengthened by recognizing the commercial 
potential value of a model like this, as velocity V is a feature that 
can be easily measured -even remotely from a satellite- and does 
not require further installments (e.g. sensors) on board. 

In order to further study what happens before and after velocity 
changes, we plot the correlation coefficient for each lag variable 
(observations at previous time steps). This gives a quick idea of 
which lag variables may be good candidates for building a predictive 
model and how the relationship between the observations and their 
historic values changes over time. 

The correlogram is a commonly used tool for checking random- 
ness in a data set. In time series analysis, the correlogram, also 
known as an autocorrelation plot, is a plot of the sample autocorrela- 
tions rp versus the time lag h. The correlogram of Figure 2 presents 
the lag number along the x-axis (time axis), with values varying 
between —8 * 10? and 8 * 10? minutes and the correlation coefficient 
value (ranging from 0 to 1) along the y-axis. In random behaviors 
the auto-correlations should be nearly zero for all time-lag sepa- 
rations. In the opposite case, one or more of the auto-correlations 
should be significantly different than zero. This is the case of RPM 
in Figure 2, which reveals a strong correlation between RPM and V 
mainly for time steps t+1, ...£2-10? (m) that can be utilized to select 
appropriate lag variables as extra features to our estimators. 
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4 PROPOSED METHOD 


4.1 Problem formulation 


A formal definition of the problem of predicting RPMs based on the 
monitored velocity (V) over ground can be defined as follows: Given 
a vessel's speed for n consecutive periods, find a function f (Vi, ..., Vn) : 
R” — Rl, which estimates the engine's RPM at moment tn+1. 

If we assume that the relationship between RPM and V is a par- 
tially linear function with non-linear segments over time, then it is 
possible to describe this specific problem as a linear mixed-effects 
model (LMM) [18]. A mixed model is a statistical model incorpo- 
rating both fixed and random effects. A random-effects model is 
a kind of hierarchical linear model, which assumes that the data 
being analysed are drawn from a hierarchy of different populations, 
whose differences relate to that hierarchy. These models are useful 
in a wide variety of applications in physical, biological and social 
sciences. They are particularly useful in settings where repeated 
measurements are made on the same statistical units (longitudinal 
study), or where measurements are made on clusters of related 
statistical units. 

The linear mixed-effects model (LMM) is a great way to model 
regression algorithms between clustered data and explore the het- 
erogeneity between effects within and between groups of similar 
values [5]. The connection between RPM and V nicely fits the LMM 
setting, since in most cases there exists some degree of correlation 
between the two features which implies a linear dependency. Also, 
in specific moments very similar V values correspond to different 
values of RPM inducing a non-linear dependency. Analytically, an 
LMM can be described as: 


y=T-dtute, (1) 


where y is a vector containing the previously observed values of 
the feature we want to predict, T is a known matrix that relates 
the observations y to the unknown fixed-effect vector d, u is the 
unknown covariate vector for random effects and, finally, e is the 
unknown vectors of random errors. Both u and e share zero-mean 
normal distributions with cov(u, €) = 0. 

A way to combine RPM and V through time (using discrete time 
slots) is to assume that the following model holds true: 


yi -f(ti) *ei; i-21..n, tj € [0, T], 
yi := RPMi, — f(ti) :- g(V(t;) (2) 


where RPM; is ME rotational speed measured at time t;, V(t;) is 
the ship's speed measured at time t; and g(V) is the sought-for 
underlying function that, when composed with the known function 
V(t), gives, for t = tj, the corresponding RPM; with error ej. For 
continuous time t, the above equation can be written as 


y(t) = f(t) + e(t), — te[0. T], 
y(t) := RPM(t), fE) = gV) = (g o VQ), (3) 


Since the measurement of V and RPM usually results in many 
noisy observations a function learned from data can have the form 
of a smoothing spline that balances between goodness and smooth- 
ness of fit. A smoothing spline f(t), t € [0, 1] in the Sobolev space 
Hl, consisting of L? functions whose weak derivatives of or- 
der up to m belong to I? as well, is a solution of the following 
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minimisation problem 
min E - fü wq - f) +a [onera] (4) 
fem? | y y A , 


where y = (yi. s Un) f (ft). wef (tn)? and W is a given 
positive definite matrix accounting for the correlation between the 
components of the error vector e. The parameter À controls the 
trade-off between fidelity-to-the-data and smoothness of fit and is 
often referred to as the smoothing parameter. In [13, 14] it is shown 
that the solution of (4) can be expressed as: 


fe) = 5 dvøv(t) + 5 iR (tt), (5) 
v=1 i=1 


where ¢,(t) = t"! /(v — 1), v = 1,..., mis a set of polynomials and 
Rl(s,t) = Áo suy! (tu) du/(m-1)*? with x+ = xifx 20 
and x+ = 0 if x < 0 otherwise, is a polynomial spline of degree 
2m — 1, yielding the well-known cubic spline for m = 2. Denoting 
T= (vti yer È = (Rl(t, te jar one can prove that 


(t)... fit) 2 Td à, à: $c, (6) 


where c = (c1, ...cn)! and d = (dj,...,dn) are solutions to the 
so-called Henderson's Mixed Model Equation (MME) [8]. Finding 
the spline estimator for the Linear Mixed-Effects Model (LMM) de- 
scribed in Equation (1), can be done using the Best Linear Unbiased 
Estimators (BLUE) method [9]. As a result, using spline estimators 
as a model for revealing the underlying relationship between V and 
RPM seems to be rational and well grounded. 


4.2 Partitioning the input space 


In order to take advantage of the ability of Splines to fit to LMMs 
problems and adapt on the local (temporal) nature ofthis correlation 
between RPM and velocity, we build on this modeling theory. For 
this purpose, we introduce splines as a way of achieving higher 
accuracy in RPM prediction based on velocity history and we apply 
clustering algorithms on the vessel's trajectory data in order to find 
sub-trajectories that share similar velocity values. 

For predicting RPM values from velocity (V) measurements using 
splines, we suggest to cluster the training space to regions with 
similar velocity patterns. Regions must have similar N previous 
values of V, on the basis that including velocity at N previous 
time steps, as an extra feature in the training phase, will lead to a 
higher accuracy when predicting RPM for time t = ti+1 as shown 
in Section 3. Therefore, each cluster will represent subsets of similar 
distributions, in terms of standard deviation and mean value, with 
respect to the history of a value Vj at a given time t; during a route. 
All velocity instances V that have similar N previous values are 
grouped in the same cluster in order to build and train different 
models that represent different distributions. The training data 
consists of a 2D vector of the form: (V(t;), Vy (ti))T with Vy(t;) = 
mean|V(t;..N, ..., V(ti)], and the corresponding RPM(t) value. 

At the final stage of the evaluation the problem converts to a 
problem of classification. Each instance V(t;) of the test (unseen) 
dataset must be classified to the most similar clustery(;). From this 
point and on we predict the corresponding RPM(t;) value with the 
specified modelķ(;) trained on this particular group of data. 
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The idea behind the proposed piece-wise regression by clustering 
is that, as previously stated, the relationship between RPM and V 
is not linear at all times. As Figures 2 and 1 indicate there exists 
a strong linear relationship between the dependent (RPM) and 
independent (V) variables, nevertheless there are parts exhibiting a 
higher-order (non-linear) correlation. This remark guides us to the 
choice of building different models, each one corresponding to a 
different part of relation between our variables (RPM,V), in order 
to improve the overall accuracy. 

Given a sample dataset comprising tuples of the form ((xi, yi). ..., 
(Xm. Ym)}, where xj := V(tj) and yj := RPM(tj), we assume that 
the relation between V and RPM, is described by a polynomial 
regression model of the form: 


F(xi) = bo + bixi + box? +... + bnx;, +éj,i=1,...m 


„or in matrix form: F = A(x1,....Xm)b+e, (7) 


where b = (bo, bi, ..., bn)? is the parameter vector, A is a Vander- 
monde matrix, also referred as the design matrix, and ej is the error 
vector. The problem of building K regression models in K different 
clusters X1, X2, ..., Xy can be formalized with the aid of (4.2) as 
follows: 


Alx + X1m br, X1 = (X11 5 xim)T € X, 


F(x)- 


Ak(XK1::-XKmg)bK, XK = (XKki 5 XKmg)! € Xk: 


(8) 


When choosing K one must take into account the trade-off between 
fitting the data and avoiding model complexity and overfitting, 
which may result in poor generalization on unseen data. This is re- 
lated to one of the most crucial aspects in function learning, known 
as the trade-off between bias and variance. The value of K can 
be chosen through cross-validation, with a possible upper-bound 
dictated by the maximum tolerable complexity of the estimated 
model. Clustering the data-set and choosing the optimal K plays a 
crucial role in piece-wise regression analysis as the results of our 
experiments (see Section 5) point out. 

A draft sketch of the algorithm that performs regression on dif- 
ferent sections (clusters) of velocity values to predict RPM follows. 

The algorithm begins with the set of (velocity, RPM) pairs D 
which is clustered into k clusters in a way that optimizes the bias- 
vs-variance trade-off. Then each instance V(t;) is classified to the 
"best" cluster D; in terms of fitness (using the normalized distance 
dij from the centroid C; of each cluster. The model that has been 
trained by the cluster that has minimum distance is used to predict 
the corresponding RPM(t;) value. 


5 EXPERIMENTAL EVALUATION 


The aim of the experimental evaluation process is to test the ap- 
plicability and the performance of the proposed methodology in 
predicting RPM from previous observations of the velocity (V). 
More specifically, the questions we examine within this experimen- 
tal section are the following: (a) Does the clustering/partitioning of 
the input space when combined with models trained separately for 
each cluster affect the prediction performance? (b) Given a dataset 
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Algorithm 1 Piecewise regression algorithm with clustered data 


Require: D = ((vi, r1), ....., (Um, rm)): vi = V(ti), ri = RPM(ti) 
1: Split D into k clusters Dj, ..., Dk 
2: foreach Dj, j € [1, k] do 
b: Mj = train regression model(D;) 


end 
foreach V(tj) do 
foreach Dj,j € [1, k] do 
Cj = centroid(Dj) 
1 


4 ie VIV. VG CI 
Vn (ti) = mean[V(tj_-Nn, ..., V(ti)] 
Dy, = arg min(d, Dj) 

end 


end 
& RPM(t;) = MiV (ti) 


containing (RMP, V) observations, is there an optimal number of 
clusters that maximizes prediction performance? (c) How does the 
number of clusters relate to the expected performance? (d) Do 
spline regression performs better than other established baselines? 
(e) How does the combination of spline regression and clustering 
of the input space perform? 

In order to preserve the statistical independence of our results be- 
tween different datasets, in all the experiments that follow we apply 
the two-sample Kolmogorov-Smirnov (K-S) test. The K-S test is a 
non-parametric test of the equality of continuous (or discontinuous), 
one-dimensional probability distributions that is used to compare 
one or more samples with a reference probability distribution. The 
size of train and test subsets for the experiments presented below is 
set to approximately 4 * 10? and 3 * 10? observations , respectively. 


5.1 Regression methods 


Apart from Spline Regression (Section 4.1), in our experiments we 
evaluated three more regression techniques namely Linear Regres- 
sion, Random Forest Regression and Neural Networks as follows: 
Linear regression is a classic regression technique, which models 
the output variables as linear combinations of the input variables. 
The regression coefficients of the input variables are usually esti- 
mated using least-squares error or least absolute-error approaches 
and the optimization problem is solved efficiently using either 
quadratic-programming or linear-programming. In order to ac- 
commodate non-linearity, when it exists, polynomial regression is 
an alternative to linear regression analysis. 

Random-Forest regression is an ensemble technique used for 
classification and regression. It starts with constructing a set of 
decision trees at training time and then outputs the majority output 
value (in classification tasks) or the mean output value (in regression 
tasks) of all individual trees. The randomness principle is either 
covered by choosing a random subset of features or by choosing a 
random subset of observations to train each individual tree. 
Neural Networks is another popular technique for regression and 
classification tasks. Using Python's Keras framework we defined a 
Neural Network with one input and four hidden layers, each one 
consisting of 10 neurons and one output layer. We used rectified 
linear unit RelU as the activation function of each layer. RelU is 
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defined as y(x) = max(0, x), and is a function that -in contrast to 
other activation functions— back-propagates the larger percent of 
the error on the output to update the neuron weights. A stochas- 
tic gradient descent process, the AdaGrad-optimizer of the Keras 
framework- has been used to find the optimal set of weights for the 
neural network. Each optimization run for 10 epochs (full training 
cycles on the training set). 


5.2 Clustering methods 


The clustering techniques used in our experiments are K-means and 
a triangulation-based clustering algorithm and are briefly explained 
in the following. The techniques have been tested [£n datasets of 
of size 107 (q = 3,4,5). 

K-means clustering is a vector quantization method, with ori- 
gins from the field of signal processing, that is widely used for data 
clustering. Its main aim is to partition the observations (vectors) 
into K clusters, so that each observation belongs to the cluster 
with the nearest centroid (representative vector of the cluster). As 
a result, the data space is partitioned into Voronoi cells. 

Triangulation clustering (DC) [4] first partitions the training 
space in triangles using a triangulation-based method. Delaunay 
Triangulation (DT) was used in our experiments due to the fact that 
is intrinsically related to the Voronoi diagram being actually its 
dual graph. Another reason for opting in favour of DT among other 
triangulation techniques is its close connection with the so-called 
Delaunay Configurations that, as stated in [19], is closely related 
with a multivariate extension of the univariate B-Splines used in 
this paper for approximation. 

By selecting a cut-off value p(a value that is used to determine 
the neighboring points from the adjacency list of each candidate 
vector [V(t;), Vw(t;)]) ) we can find for each point in the training 
space its neighboring vertices in the resulting graph. By applying a 
Depth-First-Search (DFS) algorithm it is possible to find isolated 
subgraph components recursively as depicted in Figure 4, which 
shows the resulting clusters for the pointset in Fig. 3. 

The basic idea behind clustering with triangulation is that it 
defines the cluster in a much broader manner, than, e.g., K-means, 
being able to cluster observations in non-spherical neighborhoods. 
Also K-means, in its general definition used here, doesn’t seem to 
to detect outliers .In contrast with K-means DT based clustering, 
as depicted in 4 is able to detect and remove outliers from clusters 
resulting in more "reliable" clusters. Further research for improving 
this method has to focus on the search of optimal cut-off value p. 
Both clustering algorithms showed promising performance espe- 
cially in conjunction with linear and spline regression, respectively. 


5.3 The effect of clustering 


Initial experiments were conducted for the previously described 
clustering methods with constant training size of approximately 
3 10? instances. Specifically, we utilized the algorithm proposed in 
Section 4.1 for the aforementioned regression methods, for different 
values of k (clusters) (for k-means clustering) and different cut-off 
values P (for the triangulation-based clustering). Indicative results 
are collected in Figures 5 and 6. 

Table 2 summarizes the results of the experimental evaluation 
on five, statistically independent, samples of size 3 * 10? instances 
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using different combinations of clustering (K-means, Delaunay 
Triangulation (DT)) and regression (linear LR, splines-based SR, 
random forests RF and neural networks NN). The table reports the 
covariance ofthe input variables and the Mean Average Error (MAE) 
of the predicted values. Results show that RF with K-means and 
Splines with DT clustering have the best accuracy. However, the 
optimal number of clusters varies, depending on the time instance 
that the training sample was drown and therefore its distribution. 
The DT-based methods perform better with the splines model (SR) 
instead of LR and in some cases the overall accuracy achieved 
by the SR/DT combination is higher than that achieved by LR or 
RF combined with any of the clustering methods. Also, the DT 
clustering method produces better space partitioning than K-means 
when Spline regression is going to be used for RPM prediction. 
The results of our experiments are aligned with the theory that 
K-means is locally isotropic in contrast to DT clustering that is 
moving in the search space for finding neighboring points by using 
the weighted edges of the Delaunay Triangulation. On the other 
hand Neural Networks do not seem to work well after clustering 
as the results of Table 2 indicate. 


Experimental Results 
Algorithm || variance clusterer MAE opt #clusters if # 1 
SR 63.892 K-means .595 9 
SR 66.497 K-means 2.527 6 
SR 64.693 K-means .880 26 
SR 63.892 DC .405 9 
SR 66.497 DC .527 6 
| LR 63.892 K-means .58 9 
LR 66.497 2.474 
LR 64.693 K-means .761 30 
LR 63.892 DC -580 9 
LR 66.497 — 2.474 
LR 64.693 — 2.340 
| RF 63.892 K-means .550 9 
RF 66.497 K-means 2.055 5 
RF 64.693 — 21.54 
RF 63.892 DC .550 9 
RF 66.497 DC 2.202 5 
RF 64.693 — .880 
NN 63.892 — 2.2021 
NN 66.497 K-means 2.055 5 
NN 64.693 — 2.141 
NN 63.892 DC 2.345 9 
NN 66.497 DC 9.234 5 
NN 64.693 — 4.412 


Table 2: Results of the experimental evaluation. 


The experimental results in Figures 5 and 6 and Table 2 and 
further results for five different statistically independent subsets of 
approximately 4 * 10? observations are summarized in Figure 7 that 
illustrates the mean difference (in error rate) between clustered and 
non-clustered data for different regression methods. 

Results show that clustering improves the regression algorithms 
performance especially concerning the first three algorithms (i.e. 
LR, RF, SR). On the other hand, the NN algorithm has worse perfor- 
mance when combined with clustering, which is also obvious from 
the last rows of Table 2. Another outcome is that Spline regression 
(SR) exhibits the largest improvement in terms of prediction error 
compared to the other three regression methods, when we com- 
pare performance between the application in the original and the 
clustered dataset. This result must be further examined in order to 
search for a connection between the knots of the spline estimator 
and the clustered input values. Finally, based on the experimental 
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Figure 3: Convex hull and Delaunay Triangulation of a planar training 
pointset ([V(t;), Vn (ti)] a 


Figure 4: Clustering outcome after applying DFS on the DT of Fig. 3; points 
belonging to the same cluster are connected with red linear segments.Green 
points indicate outliers. 
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Figure 5: Error convergence with K-means for varying k values 
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Figure 6: Error convergence with DT clustering for a varying number of clusters. 
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Figure 7: Mean error rate difference of regression models 
(clustered vs non-clustered samples) 


results we can conclude that for all 3 regression algorithms there 
exists an optimal number of clusters for which they achieve the 
highest accuracy. 


5.4 Finding the optimal number of clusters 


The number of clusters of the input variables is proven to be a criti- 
cal parameter that affects the accuracy of the regression algorithms. 
In order to statistically prove that the number of clusters plays a 
significant role in the process of estimating RPM, we perform the 
Kruskal-Wallis statistical test [16]. This test is a non-parametric 
approach to the one-way Analysis of Variance (ANOVA) and is 
used to compare three or more groups on a dependent variable 
that is measured on at least an ordinal level. The significant result 
in a Kruskal-Wallis test indicates that there are group differences, 
but needs a post-hoc procedure to determine which groups are 
significantly different from each other. 

The Kruskal-Wallis test in our case examines the statistical signif- 
icance between groups of initial parameters, and more specifically 
between: i) the number of clusters, ii) the clustering method, and 
iii) the Regression method. The null-hypothesis tested is that there 


are no significant differences between our groups of features that 
affect the error rate. 

The results indicate that we can accept the null hypothesis for 
the regression methods (p-value>0.05), while we can reject it for the 
rest of the groups (clustering method, number of clusters), because 
their p-value is smaller than the predefined threshold (0.05). As a 
consequence we can safely claim that the clustering method and 
the number of clusters have a significant impact on the error rate. 

The above results justify the initial idea of applying piece-wise 
regression on clusters of input values and are indicative of an un- 
derlying strong relationship between clustering and regression 
analysis that must be further examined. They also show that future 
work must examine many more hyper-parameters and their impact 
on RPM estimation. For example, the number N of previous time 
steps involved for building the training vector (V(ti), Vn(ti))!, the 
variance of the training set, the order and smoothness of the basis 
functions used in the adopted splines, etc. 


5.5 Splines Regression vs other regression 
methods 


The experiments so far showed that clustering of the input space 
results to higher accuracy for at least 3 out of the 4 proposed re- 
gression models. The aim of the next experiment is to test whether 
the combination of Linear Mixed Models and Spline Regression, 
as presented in Section 4.1, stands in practice, i.e., test whether, 
under some constraints, splines perform better than the other two 
regression methods. As a first step towards this direction, Table 
3 presents the results for the optimal number of clusters between 
10 statistically independent subsets. These results are associated 
with the red-dotted line of Figure 7, however, they provide more 
analytic information. 

The results of Table 3 are depicted in the boxplot of Figure 8 be- 
low, where the performance of each regression method is compared 
in terms of absolute value and standard deviation from the median. 
We easily observe in this plot that Splines (SR) and Random Forest 
(RF) perform better than Linear Regression (LR) both in terms of 
accuracy and variance. As far as the comparison between SR and 
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sample SR LR RF 

1 1.595/ 19 1.588 / 19 | 1.557 /19 
2 2.027/6 2.794 /2 2.255 / 5 
3 2.075 /2 2.903/2 2.589/3 

4 1.889 /26 1.761/30 1.831/ 39 
2 2.013 / 31 2.696/2 2.381/4 

6 2.034 / 27 1.5844/17 1.667 / 27 
7 1.4056 / 23 2.08 /41 1.574/23 
8 1.411 /22 1.8843 /52 1.623/17 
9 1.436 /28 1.588/27 1.856 /44 
10 1.573 /44 1.582 /57 1.937 /21 


Table 3: MAE/ optimal # of clusters of the three top, in terms 
of performance, regression methods for the optimal number 
of clusters being > 1. 


RF is concerned, while SR appears to perform better than RF, the 
latter exhibits lower variance in error rate than that by Splines. 
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Figure 8: BoxPlot of regression methods compared to error 
rate 


In order to determine which regression method performs bet- 
ter, we apply statistical testing with the results from the Table 3 
. Because of our relatively small sample size of 10 samples < 30, 
we assume non-normality to our dependent variable (the error 
measured) so we decide to conduct a Wilcoxon signed-rank test, 
which is the non parametric equivalent of a Paired T-test. Both 
test are used extensively to compare two groups of dependent (i.e., 
paired) quantitative data. Wilcoxon can be used in order to deter- 
mine which algorithm is significantly different than others in terms 
of accuracy. The null hypothesis tested here is that the true mean 
error difference between the two regression methods evaluated 
each time is greater than zero. 

The results from the three separate Wilcoxon paired ranked 
tests indicate that: a) Comparing RF with LR we get a p-value of 
= 0.18 > 0.05, meaning we can accept the null hypothesis stating 
that the true mean error difference between the two pairs tested 
is greater than zero and conclude that LR performs better than RF. 
b) Comparing SR with LR we get a p-value ~ 0.03, which is less 
than 0.05. Thus, we can reject the null hypothesis. This means that 
with a confidence level around 95% SR performs better than LR. c) 
The same can be stated also for SR and RF, as the p-value ~ 0.04, is 
close but below the predefined threshold of 0.05 indicating that SR 
performs significantly better than RF. 
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From the statistical tests conducted above, we can conclude that, 
while it is safe to assume that Splines perform better, with statis- 
tical significance, than the other regression methods, the overall 
performance from the regressors, except NN, combined with clus- 
tering was relatively good. In light of these findings one of the next 
steps of our work would be to study further Spline approximation 
theory and how clustering affects their accuracy but also to conduct 
experiments of larger scale with RF and LR. 


5.6 Combining splines with clustering 


Our experiments so far showed that a strong connection exists 
between partitioning the input space and the performance of spline 
regression. Fig. 9 attempts to validate this statement by depicting 
the mean error difference between the two clustering techniques (K- 
means and DT-based clustering) for the optimal number of clusters 
against 5 statistically independent samples consisting of «4 * 10? 
observations. 


A partitioners 
\ 
25 À /\ —— DelaunayTriPartitioner 
/ \ A- KMeansPartitioner 
/ bi 
20 4 / \ 
/ 
/ X 
ü / \ 
« 154 / b 
z / \ 
/ \ 
/ * 
J \ 
10 4 / N 
\ 
\ 
/ \ 
5 / A \ 


LR 3 Èz 
RF | Ge 
sR | d» 


NN 4 


Figure 9: Plot of mean error difference between the two clus- 
tering methods for the optimal number of clusters. 


More specifically, Figure 9 shows that, while both clustering 
techniques perform well, DT clustering seems to perform better 
when it is combined with 3 of the 4 regressors (except from neu- 
ral networks). Looking at the plot we can also state that spline 
regression combined with DT clustering presented marginally the 
largest improvement in terms of accuracy compared to the other 
two regression techniques. This experimental result agrees with 
pertinent literature, which states a connection between Delaunay 
partitions and polynomial splines; see Section 4. 


6 CONCLUSIONS AND NEXT STEPS 


The motivation problem of our work is that of vessel optimal rout- 
ing by minimizing its FOC (Fuel-Oil Consumption). Via reviewing 
the pertinent literature on the subject and conducting initial ex- 
perimentation we concluded that the problem could be handled 
efficiently if a good predictive model for the RPM (revolutionary 
speed) of the main engine of a vessel moving with known speed 
Vwere available. Furthermore, access to real industrial data, taken 
from measurements on-board ships, indicate a strong correlation 
between RPM and V on specific time instances during a voyage 
while others suggested a non-linear relationship between them. 
On the basis of the above, we have been led to the idea of de- 
veloping an RPM predictive model that separates the domain in 
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correlated subdomains with respect to velocity V. In this connec- 
tion, we opted for Spline regression (SR) in order to approximate 
the underlying function RPM(V) on each subdomain as splines are 
by their nature continuous piecewise polynomials appropriate for 
approximation in partitioned domains. The regressor team also 
included Linear regression (LR), Random Forest (RF) and a baseline 
Neural Network (NN). 

Summarising the results of the approach assembled so far, it 
seems that spline (SR) and RF (Random-Forrest) regression along- 
side with partitioning the input space either with K-means or DT- 
based (Delaunauy Triangulation) algorithm perform better and tend 
to achieve higher accuracy compared to LR NN. It is also worth- 
noticing that by enhancing our feature vector with the mean of 
velocity at N previous time - steps we managed to improve further 
the accuracy of our predictive scheme. 

Besides expanding the scale and variability of our experiments, 
our short-term future objectives will focus on investigating the 
effect of several hyper-parameters related to clustering and Spline 
regression model such as: (i) the optimal cut off value for the DT 
(triangulation) clustering algorithm and generally the optimal num- 
ber of clusters for either of the two proposed clustering methods, (ii) 
the distance metric used for in this paper only Euclidean distance 
has been tested, (iii) the population and the exact placement of the 
knots used to approximate the underlying function on each parti- 
tion and (iv) the order of the spline estimator used to interpolate 
the data on each partition. 

Also further research as far as the tuning process of the hyper- 
parameter N that control the previous time steps needs to be con- 
ducted. Finally, another issue that can be investigated is the practical 
utilization of the time instance t; that samples were drawn. A way 
to achieve this is to include weather conditions at time fj in our 
feature space. Wind speed-direction, swell, wave height etc, are 
some of the parameters that can easily be fed to our existing models 
or to larger scale - to be built - models like NN that incorporate 
the existing setting of our proposed algorithm, in order to achieve 
higher accuracy. 
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Abstract 


Although Bitcoin is a relatively new subject in Economics, 
contributions in this topic are growing very fast. Several 
papers evidenced a bubble behaviour in exchange rates be- 
tween Bitcoin and traditional currencies. In this paper we 
explore and give validation to such conjecture, proving also 
that the bubble effect is due to confidence in Bitcoin future 
values. This means that Bitcoin price/exchange rate is influ- 
enced both by future and past events, but that the bubble 
behaviour is strictly connected to trust on the future of the 
Bitcoin system. 
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works — Network performance evaluation. 


Keywords Bitcoin, Causal-Noncausal Autoregressive mod- 
els 


ACM Reference Format: 

Stefano Bistarelli, Gianna Figá Talamanca, Francesco Lucarini, and Ivan 
Mercanti. 2019. Studying forward looking bubbles in Bitcoin/USD 
exchange rates. In 23rd International Database Engineering & Ap- 
plications Symposium (IDEAS'19), Tune 10-12, 2019, Athens, Greece. 
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3331076. 
3331106 


1 Introduction 


Bitcoin is a relatively new subject in Economics and Finance, 
however, such digital currency is fostering a lot of studies, 
and contributions in this topic are growing very fast. Some of 
the studies go in the direction of understanding the reasons of 
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special activities in the market. In particular, several papers 
evidenced a bubble behaviour in exchange rates between 
Bitcoin (BTC henceforth) and traditional currencies (Euro or 
Dollars usually) [9, 17]. The aim of this paper is to explore 
the conjecture that the bubble effect is due to confidence 
in Bitcoin future values so that its price/exchange rate is 
influenced both by past events and by views about future 
ones. 

Traditional econometrics models within the class of Au- 
toRegressive Integrated Moving Average (ARIMA) are back- 
ward looking since the only time-dependence admitted re- 
gards the past [6] and are usually referred to as causal models. 
Recently, models known as Mixed causal-noncausal AutoRe- 
gressive (MAR) have been introduced in order to extend 
time dependence to the future [7, 10, 18] thus reflecting a 
backward-forward looking behaviour. 

The paper by Gouriéroux & Hencic [15] represents a valid 
anchor to refer to, at least in this area of study, as it under- 
takes a non-causal analysis of the BTC/USD rates in order to 
predict its future evolution. The present study shares with 
[15] both the same decomposition of the BTC/USD price 
in a bubble and in a fundamental part, and the observed 
time series; though, here the main objective is to investi- 
gate whether confidence in future values of the BTC/USD 
rate (i.e. the forward looking part) is the one responsible of 
the bubble effect, while in [15] the focus was on forecast- 
ing future rates. If this is the case, a significant change in 
the estimated parameters should be detected when the MAR 
model is estimated separately in the observed time series and 
in the bubble component. In particular the forward looking 
parameters should be stronger in the bubble part than in the 
observed price. 

The rest of the paper is structured as follows: the firt part is 
devoted to the economic explanation of our conjecture about 
the relation of the speculative bubble in BTC/USD exchange 
rates with the monetary policy of the Bitcoin system; then, 
in Section 3 the theory behind the Mixed Causal-Noncausal 
autoregressive models is briefly described. Section 4 describes 
the dataset and Section 5 summarizes the results of the esti- 
mation of the MAR model on the observed data. Section 6 
gives conclusions and final remarks. 
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2 The speculative bubble in BTC/USD rates 


By simply watching the trajectory of the BTC/USD exchange 
rate time series it’s easy to notice how often its pattern surges 
and bursts rapidly mimicking the one of speculative bubbles. 
The definition of speculative bubble considered in this paper 
is the one proposed by Blanchard [5] in the framework of 
rational expectations models where it is assumed that the 
economic variable of interest, say x;, has two components: 
the first one depicts the fundamental path of x;, while the 
second represents the bubble effect. In this context a bubble 
results from the departure of x; from it’s fundamental path. 
In Fig. 1 one of the major bubbles occurred in 2013 for the 
BTC/USD rate is recorded. 
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Figure 1. Bitcoin/USD observed time series 


Bitcoins are produced through a “mining” process which 
involves computers (nodes from now on) solving complex 
mathematical problems (cryptography) to keep the system 
secure; when the node find a solution to the problem it is 
rewarded with an amount of Bitcoins which is referred to as 
“Block reward”. The protocol running Bitcoin is programmed 
to halve every 4 years the "Block reward" by suitably increas- 
ing the difficulty of the mathematical problems to be solved. 
Hence, the volume of new coins will decay to zero with time 
and the long-term monetary supply will be fixed. 

This paper aims at investigating whether the peculiar 
"deflationary" mechanism running the system's monetary 
issuance is the main responsible of the formation, and the 
subsequent crash, of speculative bubbles (against the USD 
and other currencies). Indeed economic agents, before under- 
taking any action within the system, already include the sys- 
tem's monetary issuance in their preferences/expectations, 
ie. they already know that monetary issuance will never 
diverge from their expectations inasmuch the system has a 
unalterable monetary policy programmed to ever decrease 
the monetary issue over time. Therefore as the system grows 
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(think of it as the Gross Domestic Product of a national econ- 
omy) the demand of Bitcoins will increase, boosting upwards 
its price against other traditional currencies, given the ex- 
ante fixed monetary supply. 

The reason why this issuance mechanism is hereby de- 
fined “deflationary” is that as long as the general belief that 
the system as a whole will keep growing stands, the price 
against other currencies will inevitably increase, increasing 
the inter-temporal opportunity cost of spending any given 
amount of BTC. As a matter of fact since the agents know 
that the price will increase then they are encouraged to with- 
hold any transaction in BTC and increase their savings in 
BTC. A very interesting effect of such mechanism is that 
any steep fall in the price may boost the awareness of BTC 
as a system, potentially increasing it's diffusion among the 
general public, thus incrementing the aforementioned self- 
sustaining dynamic [17]. 

It must be noticed that if the system had a flexible mone- 
tary policy, where changes are not known ex-ante, then the 
economic agents within and without the system wouldn't 
be able to include it in their preferences, thus neutralizing 
the aforementioned self-sustaining mechanism, even if the 
bitcoin system is flourishing. After the explanation given 
above it must be clear now the reason why it is expected 
and tested below in this study that the speculative bubble in 
BTC/USD rates is a forward-looking phenomena. 


3 Mixed causal-noncausal autoregressive 
models 


For a long time, as mentioned by Gouriéroux & Hencic [15], 
speculative bubbles were considered as nonstationary phe- 
nomena and treated similarly to the explosive, stochastic 
trends due to unit roots. Gouriéroux & Zakoian [11] pro- 
pose a different approach and assume that the bubbles are 
rather short-lived explosive patterns caused by extreme val- 
ued shocks in a noncausal, stationary process. In particular 
they assume a noncausal AR(1) (Auto Regressive) model, 
strictly forward looking, with Cauchy distributed errors. 

A useful feature of such models is that shocks are non- 
fundamental, combining this trait with the extended time 
dependance (to the future) allows these models to perfectly 
fit the peculiar pattern of the aforementioned (definition of) 
speculative bubble. 


3.1 Introduction to noncausality 


Let y; be the observed time series onto which estimate the 
traditional autoregressive model: 


a (L) yr = & (1) 

ü-aL-:-apgL?)yi = & 
with L being the backshift operator, ie, Ly; = yii gives 
lags and L`! y; = y;j;4 produces leads and a the autoregres- 
sive parameters. It is known [18] that if s out of p of the 
polynomial's (a (L)) zeros are inside the unit circle, then the 
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model is non stationary causing the impossibility to estimate 
the traditional autoregressive model. e; represent the usual 
error term of the model. 


In Lanne & Saikkonen [18] it is shown that when p = r +s, 
with r being the zeros outside the unit circle, one can factor 
the polynomial a (z) as!: 

a(z) = 9 (z) $(z) (2) 
where ¢(z) is the usual causal polynomial of the autoregres- 
sive parameters and q"(z) has its zeros inside the unit circle. 


The polynomial q*(z) can be expressed as: 


o“z) -1-9jz-:--9:z 
--ep(1- ae perp bg d. a) 
= —95 2° qz) 
(3) 
where g(z !) = 1-9, z—-++— @s Z? in view of the fact that 


9; ,/9, = —9; forj =1,...,s and 1/95 = gs. 

Because the zeros of g*(z) lie inside the unit circle those 
of q(z) lie outside of the unit circle. Thus, (1) can be written 
as: 


$0) [-e; L° e(L )] = ee 
given the decomposition shown in (3). Also, the latter ex- 
pression can be rearranged as: 


$(L) e(L )) yi = € (4) 


where e; = -(1/9;) L ?e; = —(1/@%) é¢+s. It is important to 
notice that E;[e;] 7 €; since this variable is not determined 
by any informations available at time point t (see above). 


3.2 Mixed Causal-Noncausal Autoregressive Model 


The univariate mixed causal-noncausal autoregressive model, 
denoted MAR(r, s), shown with equation 4 is usually written 
as: 


ü-4L---é,L)ü-eL'----eL*)y = e (5) 
When ¢ = ::: = Qs = 0, the process y; represent a purely 
causal autoregressive process denoted AR(r, 0): 

ü-4iL----óD)yw-e (6) 


where y; is regressed on past values, giving the process y, a 
backward looking autoregressive dynamic. 


The process y; is purely noncausal when $1 = --- = r = 0, 
hence defined as: 
ü-9L!----eL?)y = e. (7) 


usually referred to as forward looking AR(0, s) process, being 
the exact counterpart of the model specification given in (6), 
since it's regressed on future values rather than past ones. 

Models containing both lags and leads of the dependent 
variable are called mixed causal-noncausal models. 


In order to maintain the same notation as in Lanne & Saikkonen [18] the 
polynomial a(L) will be referred to as a(z) for the following proof. 
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Assuming that the roots of the causal and noncausal poly- 
nomial are outside the unit circle, that is: 


$(z) = 0 per |z|>1 e |z| >1 
(8) 


than these conditions imply that the series y; admits a two- 
sided Moving Average (MA) representation: 


Yt = b yj €t-j (9) 


j--oo 


q(z) =0 per 


such that o; = 0 for all j < 0 implies a purely causal process 
t and a purely noncausal model when o; = 0 for all j > 0 
[19]. More in detail, the pjaAZs are the coefficients of an 
infinite order polynomial in positive and negative powers 
of the Lag operator and such that E(z) = 55. Wj z = 
EEDI B. 

Error terms e; are assumed iid non-Gaussian with E(|e;|°) < 
co V6 € (0,1) [11]. Following Gouriéroux & Jasiak [10] the 
unobserved causal and noncausal components of the process 
y: are defined as follows: 


ur (Ly: e e(L Ju: = &, (10) 


PL’) yr e G(L) v: = €r (11) 
The specification of these values will prove useful for the 
following part regarding the estimation of mixed causal- 
noncausal processes. 

The non-Gaussianity assumption for the error term en- 
sures the identifiability of the causal and the noncausal part. 
Most papers by Lanne & Saikkonen et al. use Student’s t, 
distributions, with v > 2 while Gouriéroux et al. rely on the 
Cauchy or a mixture of Cauchy and Normal distributions. 
As shown by Hecq et al. [14] it emerges that the Cauchy has 
too strong fat tails features and many series would have a 
degree of freedom between 1.5 and 2.5”. 
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Figure 2. TradeBitcoin data price download panel 


?Notice that when v < 2 then the Student's t expected value is undefined. 
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4 The Data 


The sample consists in 151 observation of the BTC/USD price 
spanning from February 20 to July 20 2013. The dynamic of 
the data is shown in Fig.1, where it is possible to notice the 
speculative bubble behaviour of the BTC/USD path, boosting 
and bursting rapidly around the month of April. In fact, 
in the April 2013 there was a famous bubble, commonly 
called simply the April bubble, that was a rally, all-time 
high and subsequent crash of the bitcoin exchange rate. The 
bubble resulted in a momentary all-time high of $266 USD 
per bitcoin on Mt. Gox? on 10th April 2013. Then Mt. Gox 
suspended trading on 11th April 2013 until 12th April 2013 
2 am UTC for a "market cooldown". The value of a single 
bitcoin fell to a low of $55.59 after the resumption of trading 
before stabilizing above $100* (a price decline of 61%). 

The data is obtained from our application, TradeBitcoin [1], 
part of the suite BlockChainVis? [2-4] used for Bitcoin anal- 
ysis and visualization, is based on finding the price options 
on the Bitcoin exchange and writing possible arbitrage op- 
erations on a database to see if it is possible to correctly 
perform an arbitrage on the Bitcoin market. It also collects 
all this data price from 17 different exchanges and it allows 
to download that data with a detection time of 1 day or 1 
hour or 15 minutes (Fig. 2). 


4.1 Price decomposition 


As a first issue it is important to disentangle the fundamen- 
tal component from the bubble component of the BTC/USD 
prices. The fundamental value of the Bitcoin is still under 
debate. While in [8] it is argued that this fundamental value 
is zero, in [9] it is linked to the reputation of the Bitcoin sys- 
tem measured by internet queries, moreover it is suggested 
(still in [9]) that the production cost of Bitcoin, due to the 
mining process, should be considered as the lower limit of 
the fundamental value of Bitcoin. Since this study assumes 
that Bitcoin has a fundamental value indeed, the price will 
be firstly decomposed following the approach in [15], where 
the fundamental path of the BTC/USD rate is assumed to 
be a nonlinear deterministic trend modelled as a 3rd degree 
polynomial in time and the bubble part is obtained by sub- 
traction from the observed prices, as it is done still in [15]. 
The other decomposition that will be undertaken builds upon 
the suggestion made in [9], by setting apart the production 
cost of Bitcoin and the bubble component using the cost of 
production model shown in [13]. 


4.1.4 Nonlinear deterministic trend 
As mentioned above the BTC/USD rate is defined as follows: 


rate; = trend; + yr, (12) 


3https://en.bitcoin.it/wiki/Mt._Gox. 
‘https://bitcoincharts.com/charts/mtgoxUSD#rg5zczsg2013-04- 
10zeg2013-04- 12ztgSzm1g10zm2g25zl. 

> http://normandy.dmi.unipg.it/blockchainvis/. 
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with rate; being the observed prices, trend; the fundamental 
component and y; the bubble component and the estimated 
trend is given by: 


trend; = 0.000073 t? — 0.0316 t? + 3.6590 t — 3.2951. 


The corresponding time-series are plotted in Fig. 3. 


4.1.2 Production cost as the lower limit of the 
fundamental value 


The Bitcoin production cost model shown in [13] assumes 
the perspective of a generic miner that is deciding whether 
to mine or not for Bitcoin. The miner will decide to join the 
mining process in case of positive profit expectations and to 
abandon it on the contrary case. The variables considered 
to be influencing the mining process and hence the produc- 
tion cost in [13] are: the block reward p, the hashing power 
(computational power) of the mining hardware equipment 
p. the difficulty set by the network ô, the cost per kilowatt- 
hour? $ kW/h and the average energy efficiency W GH/s of 
the mining hardware deployed. 

As shown in [13] the expected number of cryptocurrency 
coins to be mined per day on average given the difficulty and 
block reward (number of coins issued per successful mining 
attempt) per unit of hashing power is given by: 


P p secu, 


BTC/day = 828. hraay 


Secpr= 3600 being the seconds in 1 hour and hrggy= 24 being 
the hours in a day. 
The cost of mining can be expressed as: 


Eday = (p/1000)($kW/h W GH/s hraay) 


with $ kW/h being the electricity cost and W GH/s the av- 
erage energy efficiency. Bitcoin production cost estimates 
over the considered time span (Feb.-Jul. 2013) are shown in 
Fig. 4, where it is assumed an average energy efficiency of 
W GH/s = 500 as suggested by Garcia et al. [9], a computa- 
tional power’ of GH/s = 1000, an average global electricity 
cost of $kW/h = 0.115. In 2013 the block reward set by the 
network was f) = 25 BTC, the values of the ever changing 
difficulty over the considered time-period can be found in 
the public database https://blockchain.info. 

Assuming the lower limit of the fundamental value, given 
by the aforementioned definition of production cost, as the 
actual fundamental value, therefore the BTC/USD rate is 


In this study we consider the same average cost for electricity that is 
considered in [13], although it must be noticed that it changes depending 
on the geographical location of the miner and therefore on the national 
electricity supplier. 

TIt must be noticed that in this example varying the computational power 
does not change the cost, i.e, the cost of 1 BTC in USD is only affected by 
the difficulty, the electricity cost and the average energy efficiency of the 
mining hardware, increasing the scale of production in this case doesn't 
lead to economies of scale. 
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defined similarly as in 12: 
rate; = cost; + yc;, (13) 


with rate; being the observed prices, yc; the bubble compo- 
nent and where the fundamental component in this case is 
given by the aforementioned production cost cost;. Assum- 
ing that the production cost is correctly estimated, it must 
be noticed that the bubble component could be considered 
as the added market value. 

The corresponding time series are plotted in Fig. 4 
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Figure 3. Bitcoin/USD price decomposition 


250 m r 


closeprice 
— -— - residual-bubble 
production cost 


Jun 07 Jul 05 


Figure 4. Bitcoin/USD price vs. production cost 


5 Estimated models 


The following part ofthe study undertakes a mixed causal/non- 
causal analysis by estimating MAR models on the BTC/USD 
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price and on the bubble component according to the two 
different definitions of the fundamental part. As already dis- 
cussed in the introduction it is expected that the forward 
looking dependence is stronger in the isolated bubble com- 
ponent than in the observed price. The model specifications 
in what follows are chosen by applying information crite- 
ria which are useful tools to select the number of lags (and 
leads) to be included in the model. The information criteria 
hereby considered are the Akaike information criterion (AIC), 
the Bayesian information criterion (BIC) and the Hannan- 
Quinn information criterion (HQ) (for a general review see 
the book by Hamilton [12]). Once the number of lags/leads 
have been detected, models are estimated by maximizing the 
approximated log-likelihood function based on the Student's 
t density function for the error term; a detailed description of 
the procedure may be found in Hecq et al. [14]. The related 
Matlab routines used in this work are kindly provided by the 
authors of the above quoted paper. 


Table 1. AR(1) model's estimated parameters 


AR(1) Model t distribution 
$1 Std. Dev. À v 
0.8066 0.0234 4.3928 2.5013 


Table 2. BDS test results, purely noncausal model AR(1). 


m w p — value w p — value 
2 5,978547545 1,13E-09 9 13,3666165 0 
3 6,525463574  3,39E-11 | 10 14,65204326 0 
4 7,420797806 5,82E-14 | 11 15,91260972 0 
5 8,615265114 0 12 17,42915916 0 
6 9,743131337 0 13  19,3469674 0 
7 10,83386529 0 14 21,49415105 0 
8  12,07560832 0 15  24,05813227 0 


5.4 Noncausal analysis of the bubble component 


Firstly is considered a strictly noncausal AR(1) (forward look- 
ing): 
Ur = (1Ura + Et. 

where e; are iid Student's t distributed errors, with location 
0 and scale parameter A, e; ~ (0,2). Estimated parame- 
ters are reported in Table 1. The residuals of the models are 
shown in Fig 5. In order to test the model's goodness of fit, 
the results of the BDS test (Brock, William, Davis Dechert 
& Scheinkman, 1987) [16], used to test whether the residual 
are truly a sequence of iid Student's t random variables, are 
reported in Table 2. The test fails to accept the null hypothe- 
sis of iid distributed residuals, this implies that the present 
model must be discarded. 
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Figure 5. Noncausal AR(1) model residuals 


Table 3. Information Criteria 


p BIC AIC HO |p BIC AIC HQ 
0 6,0119 5,9565 5,9649 5 4,8862 4,5537 4,6042 
1 4,7057 4,5949 4,6117]6 4,914 4,526 4,585 
2 4,6823 4,5161 4,5413 7 4,9494 4,506 4,5734 
3 4,7513 4,5296 4,5633 8 4,987 4,4882 4,5639 
4 4,818 4,5409 4,583 


100 


Table 4. MAR(1,1) estimated parameters. 


Parameter | Estimate Confidence bounds 
$1 0.5255 0.4585 0.5925 
91 0.6503 0.5897 0.7110 
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Figure 6. Mixed causal-noncausal MAR(1,1) residuals 
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Table 5. BDS test results for the MAR(1,1) model residuals 


on yr. 
m w p-value | m w p — value 
2 1,703389269 0,0442 9 6,924875969  2,18E-12 
3  2,249277819 0,0122 10 7,294838916  1,50E-13 
4  3,342684301 4,15E-04 | 11 7,639695322  1,09E-14 
5  4,384330303  5,82E-06 | 12 8,143710723 2,22E-16 
6  5,191782337  1,04E-07 | 13 8,892820344 0 
7  5,887119414 1,96E-09 | 14 9,500629834 0 
8 6,412692984  7,15E-11 | 15 10,18321732 0 


5.1.1 Mixed causal-noncausal AR model 


The following specification of the model is derived by the 
suggestions of the information criteria, these are very useful 
tools to determine the time dependencies to be included in 
the model, i.e. they are used to determine the order of the au- 
toregressive polynomial (see equation 1) p. The information 
criteria hereby considered are the Akaike information criteria 
AIC, the Bayesian information criteria and the Hannan-Quinn 
information criterion HQ [14], Hecq et al. [14] show that 
simulation results would favour the use of BIC. As reported 
in Table 3 the information criteria suggest setting p — 2. 

When p - 2 the estimated Mixed causal-noncausal model 
is a MAR(1,1): 


G- $1 L)ü- 91 L’) Yt = Et. 


Table 4 shows the estimated parameters of the model. Fig. 6 
shows the sequence of the MAR(1,1) model’s estimated resid- 
uals é;. 

As shown in Table 5, the BDS test for independence fails 
to accept the null hypothesis of iid distributed residuals for 
most of the tested embedded dimensions, thus suggesting to 
discard the model just now estimated. 


Table 6. Information Criteria, MAR model on rate; 


p BIC AIC HQ |p BIC AIC HQ 
0 7,0358 6,9803 6,9888 | 5 4,9587 4,6262 4,6767 
1 4,7483 4,6374 4,6543 | 6 5,0137 4,6257 4,6847 
2 4,7546 4,5883 4,6136 | 7 5,0781 4,6348 4,7021 
3 4,8215 4,5998 34,6335 | 8 5,0705 4,5716 4,6474 
4 4,8906 4,6135 4,6556 


5.2 Noncausal analysis of the observed price 
BTC/USD 

Since the interpretation of the aforementioned estimated 

parameters can be rather misleading and therefore hard to be 

extended to the market reality, given the arbitrary choice for 
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Table 7. MAR(0,1) estimated autoregressive parameters on 
rate; 


Confidence bounds 
0.9740 0.9878 


Parameter | Estimate 


$i 0.9809 


Table 8. Estimated parameters of the Student's t error dis- 
tribution, MAR(0,1) model on rate; 


A v 
3.3073 1.4863 


Table 9. BDS test results for the MAR(0,1) model residuals 
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Table 11. Estimated parameters of the Student’s t error dis- 
tribution, MAR(0,1) model on rate; 


À v 
3.3901 1.6043 


Table 12. BDS test results for the MAR(1,0) model residuals 


on rater. 


m w p — value w p — value 
2 6,450686379 5,57E-11 | 9 16,22165279 0 
3 8,266350334 1,11E-16 | 10 18,05194201 0 
4 9300742867 0 11 20,14558908 0 
5 10,38450476 0 12 22,50443038 0 
6 11,66035838 0 13  25,5997597 0 
7 13,03694212 0 14 29,31704918 0 
8 14,59403959 0 15 33,57504129 0 


Table 13. BDS test results for the MAR(1,1) model residuals 
on rate;. 


m w Ho | m w Ho 
2 6,415069 1 9 16,2165 1 
3  8,26635 1 10 18,05194 1 
4  9,30074 1 11 20,14559 1 
5 10,8450 1 12 22,50443 1 
6 11,66036 1 13 25,59976 1 
7 13,03694 1 14 29,31705 1 
8 1459404 1 15 33,57504 1 


Table 10. MAR(1,1) estimated autoregressive parameters on 
rate; 


Parameter | Estimate Confidence bounds 
$1 0.9747 0.9650 0.9844 
91 0.2781 0.2077 0.3485 


the fundamental component, it is of great interest to estimate 
the MAR model directly on the BTC/USD time series (rate;). 
The aforementioned information criteria, in application 
to the BTC/USD time series, suggest to set the order of the 
autoregressive polynomial to p = 1 (see Eq. 1) or p = 2 
depending on the selected criterion (see Table 6). 


5.2.1 Estimated MAR model, case p = 1 


When p = 1 the estimated model that best fits the observed 
time series rate; is a purely causal AR(1,0). Table 7 and Ta- 
ble 8 display the estimated parameters of the autoregressive 
polynomial and of the error distribution, respectively. 
Since the distribution's degrees of freedom v = 1.4863 < 
2, then the estimated sequence of error terms €; cannot be 
likened to the case e; ~ iid t, (0, A), given the fact that when 
v « 2 the expected value of the distribution is not defined. 
Anyhow the BDS test (Table 12) for independence does not 
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m w p — value w p — value 
2 5,121150647 1,52E-07 | 9 15,28926444 0 
3 7,537452115 2,40E-14 | 10 16,97636905 0 
4 8,912767967 0 11 18,98947641 0 
5 10,07953357 0 12 21,39229563 0 
6 11,15351875 0 13 24,08437913 0 
7  12,36324983 0 14 27,23839452 0 
8 13,78408522 0 15 30,78448895 0 


accept the null hypothesis of iid distributed residuals, thus 
suggesting once again to discard the estimated model. 


5.2.2 MAR model, p-2 


When p = 2 the estimated model is a Mixed causal-noncausal 
MAR; estimates of the model are displayed in Table 10 and 11. 
Once again the estimated t distribution's degrees of freedom 
is v — 1.6043 « 2, therefore the estimated sequence of 
error terms cannot be likened to the case e; ~ iid t, (0, A), 
suggesting to discard the model once again. In any case, the 
BDS test(Table 12) for independence does not accept the null 
hypothesis of iid distributed residuals, thus suggesting once 
again to discard the estimated model. 


5.3 Residual analysis 


To sum up, MAR models are estimated directly on the BTC/USD 
time series (rate;) and then on the bubble part (y;, yc;); the 
aforementioned information criteria suggest to set the order 
of the autoregressive polynomial to p = 1 or p = 2 depending 
on the selected criterion, both for the BTC/USD rate and for 
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the bubble terms. In the former case, p = 1, a strictly causal 
backward looking AR(1) is the preliminary reference specifi- 
cation for both the full rate rate; and the bubble component 
yc; whereas a strictly non-causal forward looking AR(1) is 
the preliminary reference for the bubble component y;. For 
the latter case, p = 2, a MAR(1,1) model is found to be fitting 
all the time series rate;, y; and ycz. 
The estimation results are reported in Table 14. 


Table 14. MAR(r,s) estimated parameters on rate;, y; & yc; 


T. Series MAR(r,s) Par. Est. Conf bounds Par. Est. Conf. bounds 
MAR(1,0) $1 0.9809 0.9740 0.9878 1 - = - 
MAR(1,1) $1 0.9747 0.9650 0.9844 @ 0.2781 0.2077 0.3485 
MAR(0,1) - > = - 1 0.8066 0.8028 0.8103 

) 

) 

) 


ratet 


Ye MAR(1,1) $1 0.5255 0.4585 0.5925 1 0.6503 0.5897 0.7110 
d, 0.9803 0.9702 0.9904 qi - - 5 
d, 0.3424 0.2604 0.4245 «4 0.9396 0.9216 0.9576 


Uc — MAR(LI 


It is evident from the results in Table 14 that there is a very 
strong backward looking dependence in one lagged value, 
for the BTC/USD rate and for the bubble component yc;; 
conversely, for the isolated bubble term y;, there is a very 
strong forward looking dependence in one led value. 

The estimation of a Mixed causal/non-causal MAR(1,1) 
gives further insights on the backward and forward depen- 
dence; outcomes are summed up in Table 14 respectively for 
the full rate rate;, the bubble term y; and the bubble term 
Yc. 

Particularly interesting is the difference in the parameter 
$1 and ~ when estimating the MAR(1,1) model separately on 
the bubble component y; and on the original time series ratez. 
As shown in Table 14 the non-causal parameters (forward 
looking) ọ are stronger in the bubble component y; than in 
the observed price rate;, whereas the causal parameter $ is 
much stronger in the observed price rate; than in the bubble 
component y;. This is consistent with the conjecture made 
in the introduction, that the speculative bubble is rather 
a forward looking phenomena than a past one, since the 
forward looking estimated parameters on the bubble part 
are stronger than the ones on the observed BTC/USD price 
rater. This evidence is strengthen by the MAR(1,1) estimated 
parameters on the bubble part yc;, indeed it can be noticed 
that the value of the forward looking and backward looking 
components almost trade places when estimating the model 
on the full price time series rate; and the bubble component 
yc; respectively. As mentioned in Section 3.2, if the model 


is correctly specified then the model residuals e; should be 
a sequence of Independent Identically Distributed Student's 
t observations. In this study the IID hypothesis is tested 
through the BDS test for independence. This test is based on 
the correlation dimension, with m embedded dimension, since 
it can be shown [16] that the test statistic w is asymptotically 
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Table 15. BDS test results for the MAR(1,0) model residuals 
on ycr. 


m w p—value | m w p — value 
2 5,528256527 1,62E-08 | 9 12,29309902 0 
3 7,009012288  1,20E-12 | 10 13,36389728 0 
4 7,773551059 3,77E-15 |11 15,15913039 0 
5 8,506760111 0 12 17,1200305 0 
6 9,287013257 0 13 19,26464437 0 
7 10,17111148 0 14 21,78528929 0 
8 11,24555355 0 15 24,56128819 0 


Table 16. BDS test results for the MAR(1,1) model residuals 
on ycr. 


m w p — value | m w p — value 
2 4,236007075 1,14E-05 | 9 10,32346963 0 
3 5,315716933 5,31E-08 | 10 11,13612446 0 
4 6,272423899 1,78E-10 | 11 12,51271091 0 
5 7,320422719 1,24E-13 | 12 14,01078341 0 
6 8,132958153 2,22E-16 | 13 15,68841368 0 
7  8,838578863 0 14 17,59084596 0 
8 9,612705164 0 15 19,49717776 0 


normally distributed ~ N(0, 1), it is quite feasible to obtain 
p — values. The Tables 15 and 16 reporting the outcome of 
performing such test on the residuals e; of the yc; estimated 
models. As shown in Table 5 it can be noticed that the only 
model for which the null hypothesis of IID residuals cannot 
be rejected is the MAR(1,1) model on y;, and only for m = 1 
or m = 2, depending on the selected confidence bound width. 


6 Conclusions 


This study undertook a Mixed causal-noncausal analysis of 
the BTC/USD exchange rates time series, over the period 
February-July 2013, to test whether the bubble effect disen- 
tangled on observed data may be explained by a forward 
looking behaviour of the economic agents. In the introduc- 
tion it was noticed that given the system's monetary issuance, 
the exchange rate of one Bitcoin with respect to a traditional 
currency should be influenced by agents's future expecta- 
tions and that classical ARIMA models, backward looking by 
definition, are not suitable to describe the dynamics of the 
Bitcoin price given the fact that the only time dependence 
admitted by these model regards the past. Mixed backward 
forward looking MAR models are hence considered both for 
the BTC/USD exchange rate and for the isolated bubbles. 
The conjecture underlying this study is that the forward 
looking parameters should be stronger in the bubble part 
than in the observed price. Indeed this turns out to be the 
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case, when estimating the model on the observed data, how- 
ever the residuals analysis, conducted by performing the 
BDS test for independence, suggests not to consider this 
models valid but for one case (partially). Since the results 
of this test are asymptotical (for n — oo) and given the low 
entity of the residuals a more extensive residual analysis 
could be performed in order to assess the capability of the 
chosen model to describe the dynamics of BTC/USD rate 
and/or the isolated bubble term (y;, yc;). Several techniques 
are available such as the classical Ljung-Box-Q test on resid- 
uals autocorrelation (see [12]). Although the focus of this 
study is not to come across the true Data Generating Process 
for the Bitcoin, a deeper investigation of this issue is beyond 
the scope of the present study and will be tackled in future 
research. 

In the future we plan to evaluate the possibility of propos- 
ing cross-evaluation techniques, and propose complemen- 
tary validation with regression metrics such as RMSE, MAE, 
RMSD and others. 
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ABSTRACT 


Many context-aware recommendation methods extract contexts 
from reviews using supervised methods. However, this requires the 
optimal values for contexts to be predefined, which is not a trivial 
task. Although some approaches have avoided this by utilizing un- 
supervised methods, the extracted contexts have been limited to a 
unigram format. Moreover, most methods consider only the influ- 
ence of context on the entire dataset, ignoring the fact that context 
might be relevant to individual users or items unequally. This work 
proposes a novel unsupervised context extraction method that uses 
predictive models for future ratings. Unlike previous work, we ex- 
tract context from reviews automatically in the form of skip-grams 
by applying a region embedding technique. The predictive models 
utilize the interaction between contexts and users (and items) to 
model their influence on ratings. Experiments demonstrate that our 
models can outperform existing review-based recommendations 
that ignore contexts. 
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1 INTRODUCTION 


Recommender systems were devised to provide personalized rec- 
ommendations about specific items to individual users. The most 
common approach to making recommendations is to exploit the 
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users’ past preferences about items, such as their ratings, to cre- 
ate a predictive model for their future rating of unseen items. In 
addition to using rating data, context-aware recommenders offer 
more effective recommendations by taking into account contextual 
information (or simply “context”). Context such as location, time 
or weather can have a major influence on users’ decisions when 
they are choosing items. For example, if a user is seeking a hotel 
for a summer vacation, the recommendation engine should suggest 
hotels in a beach area, rather than in mountainside ski resorts. In- 
corporating such contexts has been shown to provide more accurate 
predictions than standard context-free recommender systems [2]. 

Although context can improve recommendations, obtaining it 
is not trivial. In traditional recommendation schemes, users re- 
view items they have previously chosen and assign rating scores to 
indicate their preference levels for those items. Context is rarely 
provided. To obtain such information, early approaches to context- 
aware recommenders collected context by explicitly asking users 
to supply it. Specifically, in addition to their ratings on items, users 
were asked to select a context from a predefined list of context op- 
tions. Collecting context data in this way is not only expensive but 
also not useful in real-world scenarios where most users have no 
intention of providing such information. Therefore, many context- 
aware methods try to infer the context from additional sources of 
data [9, 13, 23]. In recent years, the most popular source of context 
has been user-generated reviews [7, 15, 17]. 

In reviews, users can express opinions about their experience and 
level of satisfaction with the items concerned, which can, therefore, 
bea rich source of context data [7]. However, the context in reviews 
has to be recognized as such before it can be used. Two common 
approaches to this task are to using text-mining techniques or 
to define contexts as latent variables. The text-mining approach 
[6, 10, 12] utilizes techniques such as string matching to identify 
and extract words in a review that match a predefined list of context 
strings. Determining the optimal values of contexts for a specific 
domain then becomes the main challenge in this approach because 
it can significantly affect the quality of the recommendations. To 
address this issue, some approaches have applied unsupervised 
techniques to extract context from reviews [5, 19]. However, the 
data extracted by this work has been restricted to a unigram format 
(e.g., “night” or “friend”). More realistically, the context might well 
involve more than just one word, in the form of n-grams such as 
"business trip" or “king-sized bed.” 

After obtaining the context, it is then important to separate the 
relevant context from the irrelevant context. Because not all context 
will be related to the objective of the recommendation, including 
irrelevant information could degrade the recommendation quality 
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[2]. Several methods define relevant context items as those that have 
a significant influence on ratings’ distributions [2, 14, 18]. However, 
this is applicable only to predefined types of contexts that have 
static values and are of fixed size. Moreover, most methods identify 
relevant context based on its influence on an entire dataset. In 
reality, the relevance of an element of context might well depend on 
individual users’ preferences and the specific target items’ features, 
which would, therefore, influence the ratings in particular reviews. 

In this work, we propose a novel method for defining, extracting 
and representing context from review data, together with an effi- 
cient method to utilize context for personalized rating predictions. 
We define context as any word in a region of text that has an influ- 
ence on the distribution of ratings. Such words can be in unigram, 
n-gram or even skip-gram format, provided that they reside within 
the same region of text. By applying region embedding with the 
local context unit proposed by [21], the positions of context words 
in a text region can be emphasized as those that contribute the 
highest variance in ratings. As a result, contexts can be represented 
by the region embeddings that capture their influence on the rating 
distributions for an entire review dataset. To generate a rating pre- 
diction, we model the influences of context items on a particular 
review’s rating based on their past interactions with an individual 
user and/or item. Based on these interactions, we propose four pre- 
dictive models for rating prediction. Our experiments demonstrate 
that our proposed method is more accurate than the review-based 
recommendation methods that utilize only word embeddings and 
do not consider the context in making rating predictions. 


2 RELATED WORK 


Our proposed method relates mainly to two subcategories of rec- 
ommender systems, namely context-aware and review-based rec- 
ommenders. 


2.1 Context-Aware Recommenders 


In context-aware recommender systems, context is usually defined 
as “any information that can be used to characterize the situation 
of an entity” [1]. Based on this definition, [8] classified context 
into representational and interactional approaches. Most of the pro- 
posed context-aware methods adopt the representational approach, 
whereby contexts are defined by predefined sets of values, and their 
structures are static (e.g., the context Time might be defined with 
the values (^weekday;" “weekend”}). Defining optimal values for the 
predefined contexts, however, is not a trivial task, and depends on 
the specific recommendation domains. 

The interactional approach, on the other hand, assumes that user 
behaviors are influenced by context, but the contexts themselves 
are not necessarily observable. In this approach, therefore, unsuper- 
vised techniques are required to extract and represent contexts. In 
one such example, [9] applied topic modeling to represent context 
as sequences of latent topics that capture changes in users' interests. 
In another example, [23] infers context from mobile-sensor data and 
represents it in a latent low-dimensional space using unsupervised 
deep learning techniques. Finally, [13] treats social media's new 
feeds as the context for recommending the top-k relevant advertise- 
ments to users. This interactional approach has the advantage of 
not having to predefine context values, which enables the discovery 
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of hidden or unobserved contexts, and is therefore applicable to a 
wider range of recommendation domains than the representational 
approach. 

To produce effective recommendations, only relevant context 
should be taken into account. For example, the context Companion 
(e.g., "friend" or “family”) would probably be more relevant to users’ 
decisions on choosing movies to watch, than the context Season 
(e.g., "summer" or *winter"). To identify the relevant contexts, [3] 
conducted a user survey that let users imagine rating items in differ- 
ent contexts and evaluated the influence in each case. In addition to 
such manual methods, [2, 14, 18] applied statistical testing such as 
the paired t test or Pearson's chi-squared test to detect the relevant 
context items as those that contributed significant differences to 
the distribution of ratings for each context item. For example, the 
context Companion would be considered as relevant if most users 
who watched movies with friends gave higher overall ratings while 
those who watched with family gave lower ratings. Alternatively, 
[18] classified the influence of context into two levels, namely the 
population and the individual level. The relevant context at the 
population level influences the rating distributions for the entire 
dataset, whereas the relevant context at the individual level in- 
fluences the ratings for individual users or items. Most methods 
effectively identify relevant context at the population level, using 
techniques such as [2, 14]. However, some methods identify rel- 
evant context at the individual level. For example, [4] predicted 
ratings by modifying a matrix factorization to model the relation- 
ship between contexts and items. In our previous work [22], we 
proposed a latent probabilistic model that captured the interactions 
between relevant context and the user (and item) classes. 

Most methods for identifying the relevant contexts are only 
applicable to a representational approach (i.e., predefined, static and 
fixed-size contexts). In this work, we propose using an interactional 
approach to extract the relevant context from review data, based 
on the idea of two influence levels for context proposed in [18]. In 
this paper, we will refer to the influence at the population level as 
the global influence and to the influence at an individual level as 
the local influence. 


2.2 Review-based Recommenders 


In recent years, many recommendation techniques have tried to 
incorporate user-generated reviews as the main resource for mod- 
eling user preferences [7]. To produce recommendations using the 
contents of reviews, some methods have exploited word-embedding 
techniques [15, 17]. These methods assume that the low-dimensional 
representations of words such as those in Word2Vec [16] or GloVe 
[20] can be used to build and represent the user and target-item pro- 
files. In [17], these profiles were computed by summing the word 
embeddings of all words in all reviews associated with the user and 
item. Inspired by this idea, [15] combined the ratings as the weights 
for each review to create the profiles. Specifically, the preference 
profile for user uj: Sj € R^ and the characteristic profile for item 
vj Sj € Rh, where h is the embedding size, can be computed 
respectively by Equations 1 and 2. 


$; = ` 2 w2w(wt) x pre fii, j)» (1) 


Qi, j)EReviewn, Wt © Ui, j) 
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Q(i,j) EReviewo; wtEqi,j 


Here, the parameter pre fy; jj is the rating by user u; of item vj, 
Review, and Reviewy, are the sets of u;’s and vj's reviews, respec- 
tively. The function w2v(w;) maps the word wz to its embedding 
representation, which can be learned prior by any word-embedding 
technique. For rating prediction, [15] further combines the user and 
target-item profiles with the latent factors from matrix factorization 
as expressed in Equation 3. 


prefii,j) ~ aU; Vj + (1 - aU; Si + V1 Sj) (3) 

Let |latent| denote the number of latent dimensions. The model 
then applies stochastic gradient descent to learn jointly, for user 
uj and item vj, the latent factors U;, Vj € RiXl/atent| and the 
semantic latent factors Us,i,Vs,j € RX". The rating pre f(y, is 
then computed as a linear combination of the latent factors and the 
semantic latent factors, with a being the parameter controlling the 
trade-off. 

Note that all previously mentioned methods exploit pretrained 
word embeddings of all words in reviews to represent the user and 
item profiles. We believe this is inefficient because some words 
have a more significant effect on user preferences than others. For 
example, stopwords such as “is,” “a” or “the” are likely to affect 
ratings less than opinion words (e.g., “great” or “not”) or context 
words (e.g., "service" or "friendly"). Moreover, although the word 
embeddings are pretrained to capture the semantic meaning of 
words, their relationship with the ratings is not captured. 

Many context-aware methods have been proposed that exploit 
review-based recommendations to make use of the rich and valu- 
able contextual information contained in reviews. For example, 
[6, 10, 12] applied supervised text-mining techniques to extract 
contexts from reviews. However, these methods adopt the represen- 
tational approach, which requires contexts to be predefined. Some 
methods have tried to extract review contexts using the interac- 
tional approach [5, 19]. They partitioned the reviews into contextual 
reviews (containing contexts) and non-contextual reviews. The con- 
texts were then extracted as those words or topics that occurred 
most often in the contextual reviews. Although their method has 
the advantage of not having to predefine the values of contexts, 
their extracted contexts are based on the "bag-of-words" approach, 
where contexts are in the form of unigrams. Because reviews are 
Written as free-form text, we should take this opportunity to ex- 
plore and utilize the contexts that could possibly be constructed 
from more than just one word. 

In this work, our contexts are represented by skip-grams of 
words that occupy the same text region. Their embedding repre- 
sentations are learned directly to capture their relationship with 
the ratings. Consequently, those words with more influence on the 
rating distributions will have more impact on the rating predictions. 


3 PROPOSED METHOD 


In this section, we provide a detailed explanation of our proposed 
method. The workflow of our model is illustrated in Figure 1. The 
method comprises two main parts, namely context extraction and 
rating prediction. In the context extraction part, we first identify 
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Figure 1: A workflow architecture of the proposed method. 


the candidates for context words and extract all text regions con- 
taining them, based on their global influence on the distributions 
of review ratings. The text regions and their rating distributions 
are then fed into a neural network to learn the word embeddings 
and local context units, which are then used to compute the region 
embeddings. In the rating prediction part, we model the local influ- 
ences of contexts on a review rating based on the past interactions 
between their associated region embeddings with an individual 
user and/or item. Based on these interactions, we propose four 
predictive models for rating prediction, each of which is best suited 
to a particular setting for the review data. 


3.1 Context Extraction 


We first present our context extraction method, which comprises 
three main steps, namely identifying the candidate context words, 
extracting the contextual regions and learning the region embed- 


dings. 


3.1.1 Identifying the Candidate Context Words. We derive the def- 
inition from [18], where relevant contexts are defined as those 
that contribute to explaining the variance in ratings. By applying 
this definition to review data, a context can be any word in the 
reviews that influences the distribution of ratings (i.e., its variance). 
For example, suppose there are 100 reviews containing the word 
"friendly" that use a 1-5 rating scale, with 60 having a rating of “5; 
30 having a rating of “4” and the remaining 10 having ratings of 
“17 “2” and “3? This means that “friendly” implicitly influences the 
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word rating ‘1’ rating ‘2° rating ‘3° rating ‘4’ rating ‘5’ variance 
“clean” 17 59 226 532 756 100946.5 
“great” 122304.5 
“location” 30347.3 
"friendly" 39736.3 
"not" 81 13 109626.7 


Figure 2: Example of word-rating co-occurrences and their 
corresponding variances. 


bigram rating ‘1’ rating ‘2’ rating ‘3’ rating ‘4’ rating *5" variance 
“very friendly” 12 30 23035.7 
“not friendly” 143 35 4 8207.5 


Figure 3: Example of rating co-occurrences with two bi- 
grams containing the word “friendly” 


distribution of the reviews’ ratings toward high ratings, and it can, 
therefore, be considered a candidate for the set of relevant contexts. 
Therefore, our first step in extracting the contextual information 
from reviews is to identify the set of words that have influences 
on the distributions of ratings over all reviews. We call this set of 
words the candidate context words. 

To extract the candidate context words, we first create a word- 
rating co-occurrence matrix by counting the number of times each 
word in reviews occurs with each rating. Next, we calculate the 
variance from the frequency distribution of ratings for each word. 
Finally, only those words having a calculated variance above the 
minimum variance threshold minyar are selected as candidate con- 
text words and are stored in the candidate list Cand. An example 
of the word-rating co-occurrence matrix and the corresponding 
variances is shown in Figure 2. 

In Figure 2, each cell contains a word frequency under each 
rating value. For instance, "clean" occurred in reviews with rating 
^3" 226 times. The distributions of ratings can be visualized by 
the cell's grayscale shading, which represents the densities of the 
word frequencies from the highest (black) to the lowest (white). 
From this figure, we can observe, for example, that the word "clean" 
is distributed toward high ratings, while "location" is distributed 
toward low ratings. Note that, in addition to context words (e.g., 
"clean"), opinion or sentiment words such as “great” or “not” also 
have a great influence on the distributions of ratings. Therefore, we 
also include such words as candidate context words. 


3.1.2 Extracting Contextual Regions and Their Rating Distributions. 
Depending solely on the candidate context words might not be 
sufficient to cover the variety of influences of contexts on the distri- 
butions of reviews' ratings. This is because neighboring words that 
are often written together with those candidate context words might 
significantly alter the ways they influence the distributions of rat- 
ings. For example, Figure 3 shows the example of the co-occurrence 
of two bigrams “very friendly" and “not friendly? Although they are 
generated from the same candidate context word "friendly? their 
rating distributions are totally different (one is very positive and 
the other is very negative). This is because those nearby words 
might be opinion, sentiment or other words that could change the 


Sitkrongwong and Takasu 


“The hotel is located in the city center which is very convenience for us. Also, the 
room is comfortable and clean. And the staff never fail with their services. We 
would definitely comeback here!" 


Candidate context words 


"great" Contextual regions 
staff which is very convenience for 
“not” : 
" -— room is comfortable and clean 
location ý 
“clean” comfortable and clean and the 
“very” and the staff never fail 
“friendly” . i 5 
region size = 
“comfortable” 8 


Figure 4: Extracting the contextual regions from a review 


semantic meanings of the candidate context words, and therefore 
influence their rating distributions. For this reason, neighboring 
words are crucial and should be incorporated with the candidate 
context words to model the influences of contexts effectively. 

Consider a candidate context word cn € Cand. We define the 
nearby words of c, as any word w; that occupies the same text 
region of cn, i.e., wr € region(c,,d)s where d is the window size for 
a region of length 2 x d + 1. Note that w; can be in any position in 
regioni, 4), and it does not necessarily have to be directly adjacent 
to cn. This takes account of the different writing styles users may 
adopt for the same meaning in writing reviews. For example, "the 
staff are friendly" and "they have friendly staff" both indicate the 
same context “friendly staff” 

We, therefore, define a context as any word in a region of text 
that has an influence on the distributions of ratings. Such words 
can be in unigram, n-gram or even skip-gram form, provided they 
occupy the same region. To identify the positions of these words, 
we first need to extract their associated regions, namely contextual 
regions. For each cn, we extract all contextual regions of size 2x d 1 
from all reviews containing cn. Figure 4 shows an example of the 
contextual regions of size 5 extracted from one review. Given that 
this review contains four candidate context words, four contextual 
regions are extracted. 

After all contextual regions are extracted, we store them in the 
list of contextual regions Region. For each contextual region with 


(m) 
(cn. d) 
tions of the words that, along with cn, contribute to the highest 


variance in the rating distributions. This can be achieved in the 
same way as identifying the candidate context words. Specifically, 


(m) 
(en. d)? 
p € 2X d + 1) that include cn. We then count the number of times 


each skip-gram occurs in the same region with each rating and 
calculate the variance from the frequency distribution of ratings. 
Figure 5 illustrates the process of identifying the highest con- 
tributed variance skip-gram of size p = 2 for the region “room is 
comfortable and clean.’ If the skip-gram of cn and wz yields the high- 
est variance and is more than minvar, wt will be chosen as a part of 
the context for the region, along with cn. Finally, the rating distribu- 
tion of the skip-gram that yields the highest variance is selected as 


the label dist r 


In Figure 5, the rating distribution of the skip-gram (“comfortable,” 


the index m: region € Region, we need to identify the posi- 


for region we first generate all skip-grams of size p (where 


to represent the rating distribution of that region. 
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skip-gram rating ‘1’ rating ‘2’ rating ‘3° rating ‘4’ rating ‘5’ variance 
(room, comfortable) 15 15 17 10 
room : 
is (is, comfortable) 8 7 9 7] 10 2 
comfortable — 
and | (comfortable, and) 5 G i 10 8 4 
clean 
(comfortable, clean) 8 15 22 23 23 44 
* highest variance > min,,,, 
room is comfortable and clean 8 15 22 23 23 


contextual region: region”, rating distribution: isi”, 


Train data 


Figure 5: Identifying the skip-gram that contributes the 
highest variance in rating distributions. 


"clean") is selected as the label for the region “room is comfortable 
and clean.” If no skip-gram exceeds the variance threshold minyar, 
only the candidate context word c, is chosen as the context for 
that region, and its rating distribution is chosen as the label. After 
finding the set of d a for all region, 
stored in a rating-distribution matrix Dist. 


E Region, they are 


3.1.3. Learning the Region Embeddings for Contextual Regions. We 
now have the contextual regions Region and their associated rating 
distributions Dist. Our objective now is to build a predictive model 
d 
rating distribution d ist? aj 85 an output. To achieve this, we need 
(m) 

(cn.d) 
We choose the region embedding with 


that, given a contextual region region as an input, predicts a 


a model that can be used to identify which words in region 


contribute to disti? gy 
local context unit proposed by [21] as our training model. This 
model learns the region embeddings, i.e., the representations of the 
text regions, with the help of word embeddings and local context 
units. The local context unit is a weighting matrix that captures 
the interactions between a word and its neighbors in a text region. 
Our interest here is that the local context unit can be applied to 
emphasize the positions of the words that have an influence on the 
rating distributions, and therefore can be considered as part of the 
context. 

Formally, every word w; has an associated word embedding 
€y,, Which is stored in the column of the embedding matrix E € 
g^xlVocabl where h is the embedding size and Vocab is the vo- 
cabulary of all words in the training data. In addition to the word 
embeddings, a candidate context word c; also has its associated 


€ gRhxGxd1) which is stored in the 


local context unit matrix K;, 
tensor C e R'xCxd*1xlCand|. Note that we only need to learn 
K,,, for |Cand| candidate context words, unlike the original model 


that learned this parameter for all |Vocab| words. 
(m) 
(cn, d) 


word embedding pw, of word wz at index position l of region 


Given a contextual region region as an input, the projected 


(m) 
(en.d) 
is calculated by Equation 4. 


Pw; = Ke„,1 O ew, (4) 


(m) 
(cnd) 
is projected into the point of view of the candidate context word cn 


The word embedding ew, of word wz at position | of region 
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by element-wise multiplying with the corresponding column / of 
K,,,. This indicates that cy can alter the semantic meaning of ew,- 
For example, w; = “clean” in the region of cn = “comfortable” has 
a different semantic meaning from when it is in the region of cn = 
“not.” 

After obtaining all projected word embeddings, the region em- 


bedding i c R"*! of the contextual region region! 4 


puted by applying the max-pooling operation over all projected 
word embeddings given in Equation 5. 


is com- 


AER = max([pw, 4 Pwa cc Pen +++ Pwrsa-1 Pwral) (5) 
Finally, n d) is fed into the fully connected layer to calcu- 


(m) 

Cn. d) 
inally proposed for the classification task, we want to predict a 
rating-distribution vector. We, therefore, adopt a multivariate linear- 


regression model. Our model can be expressed as in Equation 6. 


late the rating distribution dist Although a model was orig- 


disti? a) ~ KE, C,W,b) =W- id gth (6) 
Here, the parameter x is an input, i.e., the contextual region regioni" dy 


W e Rlrating|xh and b e RlratinglX! are the weight matrix and 
bias vector, respectively, where |rating| is the size of the categorical 
rating scores. We chose L2 as our loss function, following [21], and 
Adam [11] as the optimizer. No regularization is applied. 

After all model parameters (E, C, W and b) are learned, each con- 


(m) 


(c,,d) Can now be mapped with its region em- 
(m) 
(Cn, d) 
to capture the global influence, i.e., the influence on the rating dis- 
tribution of the entire review dataset, of its associated contextual 
region. This means that if two region embeddings are similar in 
the embedding space, they will be expected to contribute similar 
rating distributions. 

In the next section, we show how the extracted contextual re- 


textual region region 


bedding representation r Such a region embedding is trained 


gions and their associated region embeddings can be utilized in the 
rating prediction task. 


3.2 Rating Prediction 


The previous section explained how we are able to extract context 
from reviews as contextual regions and represent them by their 
associated region embeddings. These region embeddings, however, 
capture only the global influence of the contextual regions on the 
rating distributions for the entire review dataset. 

To make a personalized rating prediction, it is important to model 
the local influence of each contextual region on a particular review 
provided by an individual user with respect to a specific item. This 
is because a contextual region might have a range of possible in- 
fluences on the user's decision about the item that depends on its 
relevance to the user's personal preferences and the item's unique 
features. For example, a user might choose a hotel based on the 
cleanliness of the room but ignore the location ofthe hotel, meaning 
that contextual regions containing the word "clean" should have 
more influence on the ratings for this user than those that con- 
tain "location?" Similarly, a hotel might be famous for its location, 
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which should have more influence on the ratings it received than 
its breakfast menu. 

To model the local influences of contextual regions, we introduce 
T, € RlUserslX^ the user—context interaction matrix; and Ty € 
Ri/tems|xh the item—context interaction matrix. The parameter h is 
the embedding size, which is the same as for the region embedding. 
Each row in ty; € T, and ty, € Ty represent user u;'s and item 
v;’s interaction vectors, respectively, with the contextual regions. 
These vectors are learned to capture the past interaction of each 
user/item with the contextual regions and model the influences of 
such regions on the ratings for that user/item. For example, if uj 
wrote reviews containing the contextual region "room is comfortable 
and clean" with rating “5” a significant number of times, t, will be 
able to justify this contextual region's relevance to uj’s preferences, 
because it influences the ratings given by uj. 

The vectors tu; and tz, can be seen as projection vectors for 


(m) 
(cn, d) 
influence to the rating distribution of the entire review data) to 
(m) (m) 
(cn. d).i (en. d).j 
local influences on the ratings of u; and vj, respectively). We call 


(m) (m) 


converting the region embedding r (which captures the global 


the region embeddings r and r (which capture the 


Pend) i the user-relevance region embedding, and Kend) j the item- 
relevance region embedding. Specifically, Aid di and i apj te 


(m) 
(cn, d) 
element-wise multiplication, as expressed in Equation 7. 


computed from the interactions between r and ty, or to, using 


(m) 


Hd Cody ends tM PO 


(md? — l'(end)j (6n. d) 


After the region embeddings for all contextual regions in the 


=ty, Or 


review qj, jj € Review are converted into user-relevance and item- 
relevance region embeddings, they are ready to be used to predict 
the rating of user u; toward item v;. Our predictive model is repre- 
sented by Equation 8. 


prefa, jj = gx Tu, To, W', b^) = W' -Yqa j +b (8) 

This model can be considered as a neural network that uses 
simple linear regression to predict the rating output pre f(;, ;), given 
a review qj, jj as an input x’. The model utilizes the region em- 
beddings of all contextual regions extracted from qi; j) to compute 


TUO (m) 
(en, d), i (en, d). 
sentation of the review q(; j) in a fully connected layer, which can 


be computed in various ways depending on the predictive model 
used. The parameter W^ € R!** is a weight vector and b’ € R is 
a scalar bias. Similarly to the context extraction model case, we 
again choose L2 as our loss function and Adam as the optimizer. 
No regularization is applied. 

We propose four different predictive models based on how the 
user-relevance and item-relevance region embeddings are used. 
These models are the no relevance (NR), user relevance (UR), item 
relevance (IR) and user-item relevance (UIR) models. They are de- 
signed to deal with various cold-start and sparse data situations. 


and r . The parameter yq, € R^*1 is the repre- 


3.2.1 NR model. To show the importance of considering the rel- 
evance of context to each user and each item, we first propose a 
predictive model that ignores the relevance of contexts. In the NR 
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Figure 6: Illustrations of three of the four proposed models: 
(a) UR model, (b) IR model and (c) UIR model. 


model, the region embeddings of the contextual regions are used 
to compute yq p directly without any conversion, as shown by 
Equation 9. 


= (m) 
Yau = 2s Kena) N (9) 
m€Ma,, j) 
(m) 


(cn. d) 
N contextual regions extracted from the review qj, j). The rep- 


resentation of q(; jj is computed by averaging all corresponding 
region embeddings of the contextual regions in q(; j). Therefore, 
the predicted rating for q(; jj does not depend on any interaction 
with either the user or the item. This means that if two reviews 


Here, Mq, ; is the set of indexes of region in Region, for all 
i,j) 


contain exactly the same set of contextual regions, they will receive 
the same rating. The NR model is suitable for very sparse datasets, 
where there are few reviews for most users and for most items. 


3.2.2 UR model. A graphical representation of the UR model is 
given in Figure 6 (a). In this model, only the user-relevance region 
embeddings are used to compute yq ,, as given in Equation 10. 


" (m) 
Ya; = 25 AMETS [N (10) 


meMay, y) 


The UR model predicts the rating for review qq, j) considering 
only the relevance of context to user uj, with the relevance of 
context to the item v; being ignored. The idea behind this model 
is that the rating a user will give to any item depends only on 
the suitability of the contextual information to that user's past 
preferences. With this model, any item with the same set of contexts 
will receive the same rating from the same user. The UR model is 
designed for datasets where a cold start is a problem for the items, 
ie., most items have only a few reviews. 


3.23 IR model. Figure 6 (b) represents the IR model. In contrast to 
the UR model, this model computes y q; by considering only the 
item-relevance region embeddings, as expressed by Equation 11. 
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Yau, = 2; d d),j d (11) 
meMaq, j) 

In contrast to the UR model, the idea behind the IR model is 
that the rating an item will receive from any user only depends on 
the suitability of the contextual information to its unique features. 
This means that any user who chooses the same item with the 
same set of contexts should generate the same rating. The model 
is appropriate for datasets where a cold start is a problem for the 
users, i.e., most users have only a few reviews. 


3.2.4 UIR model. The fourth and final model, the UIR model, is 
slightly more complex than the other three. As shown in Figure 
(m) 

(cn, d) 
interaction vector t,, and the item interaction vector toj. Therefore, 
this model employs both the user-relevance region embedding 


Din d.i and the item-relevance region embedding A dj We 
apply the max-pooling operation on these two region embeddings 
(m) 

(en d) 4,1) 99 


6 (c), the region embeddings r are projected to both the user 


to create the user-item relevance region embedding r 
expressed by Equation 12. 


(m) _ (m) (m) 
fedi T ME osa Tend) j) (12) 


The purpose of this combined region embedding is to capture 
the maximum relevance of region, d) to the pair of uj and vj. 
For example, this means that, even if a contextual region is not 
highly relevant to an item's features but it is highly relevant to a 
user's preferences, it can still affect the rating. For this model, the 
parameter yq, ; is computed by Equation 13. 


- (m) 
Yon= 2, "enun! N (13) 


m€Mmq, ;) 


J) 


Compared with the other three models, this model is more real- 
istic in that contexts should be relevant to both users’ preferences 
and items’ features, and should, therefore, influence the review rat- 
ings. In other words, the rating that the review will receive depends 
on the suitability of the contextual information to both the user’s 
preferences and the item’s features. The UIR model is suitable for 
datasets for which there are significant numbers of reviews for both 
users and items. 


4 EXPERIMENTS AND DISCUSSION 


4.1 Data Preparation 


We used the publicly available TripAdvisor dataset!, which con- 
tained 878,561 hotel reviews when used for our experiments. We 
preprocessed the data by tokenizing the reviews and converting 
all words to lower case. All punctuation marks and infrequently 
used words (i.e., those of appearance frequency below 0.01% in all 
reviews) were removed. We also removed all stopwords listed by 
NLTK?, except for those that indicate sentiment meanings (e.g., 
"very" or “not”). After the preprocessing, reviews of less than two 
words were marked as uninformative and were therefore discarded. 
The preprocessed-data statistics are given in Table 1. Because the 


‘http://www.cs.cmu.edu/ jiweil/html/hotel-review.html 
?https://www.nltk.org/nltk data/ 
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Table 1: Statistics for the Preprocessed TripAdvisor Dataset 


Reviews 873,199 
Words per review | Min:2 Max:2011 Average: 83.008 
Users 575,264 


Reviews per user Min: 1 Max:63 Average: 1.385 


Items 3,941 
Reviews per item — Min: 1 Max: 5,426 Average: 221.568 
Ratings 1: 53,501 2: 59,711 3: 121,780 4: 291,913 5: 346,294 
% Ratings 1: 6.1% 2: 6.8% 3: 13.9% 4: 33.4% 5: 39.7% 
word rating ‘1? rating ‘2° rating *3* rating “4” rating ‘5’ 

“great” 5482 11513 32002 105846 147186 
“good” 9051 d 40018 91115 

“not” 32373 35551 60391 103517 
“clean” 95295 99046 
"night" | 17084 18622 30851 57660 

word rating ‘1° rating ‘2? rating ‘3’ rating “4” rating ‘5’ 
“great” 5.85 11.01 16.38 24.32 29.41 


“good” 9.78. 

“not” 
“clean” 7.32 E23) 19.74 
“night” 18.63 17.93 Í 11.42 


(b) 


Figure 7: Rating distributions for example words: (a) before 
standardization, (b) after standardization. 


extraction and prediction parts require different types of input, 
further specific data preparation for each part was necessary. 


4.1.1 Data for Context Extraction. We randomly chose 90% of the 
dataset to train the model parameters for the context extraction. 
Here, the main task was to extract the contextual regions and their 
associated rating distributions. The first step was to identify the 
candidate context words that influence the rating distributions. 
However, the main problem was that some rating datasets con- 
tained a bias in the proportion of ratings provided by the users. 
For example, as given in Table 1, more than 80% of the reviews are 
rated as “4” or “5? Therefore, almost every word in the corpus was 
distributed toward high-rating scores, as shown in Figure 7 (a). To 
make it possible properly to analyze the influence of a word on the 
rating distribution, we applied a data standardization technique, as 
expressed by Equation 14. 


xpew = rr (14) 
Or 

Here, x;,; is the frequency of word wz given for rating r, pir is the 
average of the frequencies of all words given for rating r and o; 
is the standard deviation of the frequencies of all words given for 
rating r. The rating distributions after applying this standardization 
are shown in Figure 7 (b). For example, the word “not” is now more 
distributed toward low ratings, which is more appropriate, given 


that it indicates a negative meaning. 
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Table 2: Model Parameters for All Evaluated Methods 


Method Rate/User  Rate/Item Emb. Size Latent Learn. Rate 
WordEmb 1 1 300 - 0.001 
NMF 5 5 100 10 0.001 
NMFw2v 5 5 100 10 0.001 
NR 1 1 300 - 0.001 
UR 10 1 300 - 0.0001 
IR 1 200 300 - 0.001 


UIR 5 5 300 - 0.0001 


After standardization, applying minvar = 1 gave us a set of 
226 candidate context words Cand. We set a region size of 5 for 
the extraction of contextual regions from reviews. Because each 
Cn € Cand might be the first or last word in the review, we first 
added a padding of length d = 2 to the head and tail of each review. 
The result of the extraction gave us 25,077,762 contextual regions. 
To avoid scalability problems in the training process, we randomly 
sampled only a subset of the contextual regions for the training. 
Specifically, let Region, denote the set of contextual regions of cp. 
If |Region,, | > 100k, only a 10% subset is used for training. If 10k 
< |Regione, | < 100k, 10k are used. If |Regionc,, | < 10k, all are used. 
After this process, 3,125,212 contextual regions were selected as 
the training regions. We generated skip-grams of size p = 2 and 
assigned the rating distribution with the highest variance to each 
region, using Minyar = 1. 


4.1.2 Data for Rating Prediction. We evaluated the predictive per- 
formance of the four models using a fivefold cross-validation tech- 
nique. The model parameters for our predictive models and for the 
other methods evaluated are presented in Table 2. We compared our 
proposed models with three other methods, namely word embed- 
ding for regression (WordEmb), nonnegative matrix factorization 
(NMF) and NMFw2v. WordEmb is the method whereby word em- 
beddings are learned to predict the rating for each review (i.e., we 
compute the average of all word embeddings in a review, and feed 
it to the fully connected layer for prediction). NMFw2v [18] is a 
version of NMF, extended to incorporate the user and item em- 
bedding profiles, as expressed by Equation 2. We followed [18] to 
set values for the latent dimension and embedding sizes and used 
Gensim Word2Vec? to learn the word embeddings for all words in 
our corpus. 

To train the NMF, NMFw2v, UR, IR and UIR models, we selected 
only users and/or items with a significant number of reviews, to 
ensure that these models have sufficient data to learn the embedding 
profiles and/or the interactions with contexts. For example, only 
reviews from users who had provided more than 10 reviews were 
used to train our UR model, thereby producing high-quality user- 
context interaction vectors. For our predictive models, all reviews 
with no candidate context words were discarded, which resulted in 
different sizes of training and test datasets for each model. Finally, 
we trained all models using the Adam optimizer, with L2 as loss 
function. The regularization parameters for NMF and NMFw2v 
were set to 0.1 and 0.001, respectively. 


3https://code.google.com/archive/p/word2vec/ 
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Figure 8: Visualization of the local context units for some 
chosen candidate context words. 


4.2 Results and Discussion 


In this section, we first visualize and discuss the extracted candidate 
context words, together with the influence of their neighboring 
words on the rating distributions. We then present the predictive 
performances of our predictive models compared with the other 
methods and discuss these results. 


4.21 Influences of Candidate Context Words and Their Neighboring 
Words. As discussed in Section 3.1.2, their neighboring words might 
alter the influence of candidate context words on their rating dis- 
tributions. To analyze such influences, we follow [21] by applying 
the L2-norm to each column of the local context unit. This enables 
the influence levels of the candidate context and their neighboring 
words to be emphasized, as shown in Figure 8. For example, words 
that follow “staff” and “very” have more influence on rating distri- 
butions than the words that come before them. This corresponds to 
the following words often being “good,” “helpful” or “friendly” for 
“staff? and “clean,” “convenient” or “comfortable” for “very.” On the 
other hand, words such as “breakfast” are less influenced by neigh- 
boring words, meaning that the word itself sufficiently describes 
the rating distributions without any help from neighboring words. 
Moreover, the local context units can differentiate the influence of 
positive words such as “good” or “excellent.” Although the rating 
distributions of “good” are influenced by its neighboring words, the 
word “excellent” is not. This is because the word “excellent” itself 
indicates the strongest positive meaning, whereas the semantic 
meaning of “good” can be altered if it follows words such as “not” 
or “very.” 

For these reasons, we see that the local context units can capture 
the influences of the candidate context words efficiently, together 
with their neighboring words, on the rating distributions. This 
further helps to produce high-quality region embeddings, which 
are capable of semantically representing the distribution of ratings 
for the individual contextual regions. 


4.22 Predictive Performances. The prediction accuracy (MAE and 
RMSE) of our predictive models, compared with the other methods, 
are presented in Table 3. From the table, our models provide the best 
prediction accuracy, followed by WordEmb, NMFw2v and NMF. 
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As compared with WordEmb, which exploits embeddings of 
all words to predict a rating, our NR model yields a similar pre- 
diction accuracy. This means that exploiting a smaller number of 
embeddings from the contextual regions is sufficient to produce an 
effective predictive model. The accuracy, however, can be further 
improved by considering the relevance of contexts to user and/or 
item, as shown by the results from the UR, IR and UIR models. 

As compared with NMF, all other methods that incorporate the 
review data provide better results. This means that the textual con- 
tent help improves the prediction accuracy over the model that 
considers only the preference data. However, the NMFw2v model, 
which exploits the pretrained word embeddings to create user and 
item profiles, produces no significant difference in accuracy com- 
pared with the baseline NMF. In our experiment, we are able to 
get the best prediction accuracy for NMFw2v with a = 0.95. This 
means that NMF contributes to most of the prediction, not the 
user or item embedding profiles. If we compare this method with 
WordEmb, which learns word embeddings directly for the predic- 
tion task, NMFw2v yields a lower accuracy. In our case, we might 
assume that learning the word embeddings directly for prediction 
is more efficient than exploiting the pretrained word embeddings. 

Now, let us compare the predictive performances between our 
four predictive models. First, the UR, IR and UIR models outperform 
the NR model, meaning that considering the relevance of contexts 
to user and/or items improves the accuracy of the model. However, 
because the NR model does not depend on the past interactions 
between the users or items with contexts, it can make a prediction 
even if a user or item has no review data. Similarly, the UR model 
does not require any item’s interaction, while the IR model does not 
require any user’s interaction with contexts. These three models 
are suitable for dealing with different cold-start scenarios, where 
there are new users and/or items with no past interaction with the 
systems. For the UIR model, it provides the best overall accuracy 
among all other models because it considers the relevance of con- 
texts to both users and items, and models their local influences to 
the rating. However, its accuracy comes with the trade-off with 
the prediction coverage because it requires a significant number 
of reviews from both users and items to model their interactions 
with the contextual regions. We, therefore, conclude that our four 
predictive models are suitable for different situations, depending on 
the characteristics of the review data. If the review data are dense, 
both users and items have a significant number of reviews, the 
UIR model is recommended. If the data has user or item cold-start 
problems, then UR or IR is preferable. Finally, though less accurate, 
the NR model is more robust to the sparse data because it does not 
require any interaction from users or items. 


5 CONCLUSION 


We propose a novel unsupervised context extraction method for 
review data, along with the predictive models for rating prediction. 
Because our method automatically extracts contexts from reviews, 
there is no need to predefine the optimal values of contexts. Unlike 
any previous context-aware method, our context is defined in a 
form of skip-gram, meaning that it can be constructed from any 
word in the same text region. This makes our method suitable for 
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Table 3: Prediction Accuracy Results 


Method MAE RMSE 


WordEmb 0.8962 1.1653 
NMF 1.8530 1.6478 
NMFw2v 1.1377 1.8809 
NR 0.9038 1.1661 
UR 0.7623 0.9806 
IR 0.8371 1.1169 


UIR 0.7400 0.9647 


extracting contexts from reviews, where users have different writ- 
ten styles to indicate their contextual information. Moreover, we 
consider the influences of contexts to both population and individ- 
ual levels. While the influences of contexts on the entire reviews' 
ratings are captured by their region embedding representations, the 
rating prediction is made by considering the relevance of those con- 
texts to the individual users and items. The four predictive models 
make our proposed method suitable for dealing with different cold- 
start and sparse data situations. Experimental results show that 
our models yield better prediction accuracy than the review-based 
recommendations that do not consider contexts. 
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ABSTRACT 


This paper raises the privacy issues related to information that 
is accessible about individuals from their mobile devices and that 
which is collected when they interact with and use so called "free" 
services provided on the web. The importance of privacy has been 
ignored by most legislation and any laws passed have no teeth. 
The only exception is the privacy protection that is embedded in 
the EU's General Data Protection Regulation(GDPR). GDPR gives 
control to individuals over their personal data and requires any 
organization which collects and controls personal information to 
have in place appropriate measures both technical and logistic, to 
implement the data protection principles. In this paper, we propose 
a technical solution to provide a personal email and web server 
with complete control of all correspondence and contents. This 
would liberate users from fake free services and provide privacy 
and security. 
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1 PRIVACY ITS RISE AND FALL 


According to some, privacy was the result of the growth of the 
middle class who could afford better housing and their abode, even 
though humble, became their castle. Liberalization of laws such as 
the one that took the state out of the bedrooms recognized the right 
of people to be left alone. However, this was only at the governmen- 
tal level since it had supreme power but the corporations were left 
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to regulate themselves. At the end of the first gilded age, the need 
was felt to regulate the robber barons and their corporation. This 
was done by the emergence of the workers unions and the need to 
share the riches with the worker. The later was implemented with a 
progressive tax system that supported the common good by taxing 
the haves to transfer to the have-nots. This afforded the luxury 
of privacy to the not so rich! It must be noted that recently, the 
Supreme Court of the largest functioning democracy in the history 
of humankind has ruled that privacy is a human right [44]: this in 
spite of an inane pronouncement of a kid who became fabulously 
rich and influential[53]. 

The tools and technology that are the roots of the privacy ex- 
ploitation required many developments quite a number of these 
occurred in the mid 20th century. The first of these was the intro- 
duction of computers and the development of semiconductors, its 
miniaturization and increase in computing power. The intercon- 
nection of computers and hence people was made possible with 
the introduction of the Internet. In the early days, access to the 
internet was limited to the academic communities, some businesses 
and government agencies. The introduction of the web and the 
development of the graphical browsers opened up the internet to 
an increasing number of users. 

The offer of web based free email service by start ups fueled by 
venture capitalist allowed these companies the continuous access to 
all email communications of an increasing number of users world 
wide! The undeclared charge for this free service was and continues 
to be the contents of their messages and these users who in turn 
attracted more users and more contents. There being no laws to 
protect the privacy of the contents of these messages, which are 
in plain text. This clear text being the raw material for tailoring 
targeted publicity to the users of this free email service was the 
raw material used initially for monetizing the ‘free’ service. 

The graphical web browser also opened up a new method of 
communicating with family and friends. However, since the ma- 
jority of users were seduced by the allure of setting up a fee web 
presence without the need to be tech-savvy, this was the start of 
what we now call online social networks(OSN). The OSN attracted 
not only friends and family but complete strangers. This attraction 
of becoming a celebrity has been the force that pushed up the num- 
ber of such users to billions. To this date, instead of recognizing 
themselves as publishers, these OSN claim to be simple platforms 
and take no responsibility for the content they provide which has 
created many problems, including bullying, extreme self-harm, fake 
news, genocide and terrorist act co-ordination[95]. 

The web opened up the internet to the non-tech savvy user. In- 
stead of creating tools to make the Internet service such as email 
to be used privately, securely and easily accessible to the non-tech 
savvy or offering it via the national postal service, the unimagina- 
tive politicians let it be provided by venture capitalists. 
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A system such as the web robot and allowing any robot to be 
served freely by the web servers was one of the biggest blunders 
made by the web community. The taking over of the web by the 
commercial players and facilitating tracking in the web browser 
are other blunders driven by monetizing the web. It is reported that 
there is an attempt to undo the harm in the new web but again, 
some of the players are the same as in the first web, and the same 
commercial pressure would jeopardize this solution. 

The cell phone which started out as a bulky device profited from 
the miniaturization of semiconductor circuits and started getting 
smaller, lighter and more popular. The addition of a screen and more 
computing power and better and better networks transformed the 
cell phone from an emergency device to a personal communication 
device. The tech giant companies made sure that the new ‘smart’ 
cell phones have access to their services and the cell phone replaced 
the personal computer and the land line. The cell phone system 
also allowed bypassing the need of setting up an extensive telecom- 
munication infrastructure and the need for laying and maintaining 
cabling and relay stations. This was replaced by cell phone towers 
and relay stations. The mobile system speeded up communication 
in emerging economies as well as the developed ones. 

With the growth of the internet and the mobile communication 
infrastructure, the opportunity of gathering data on individuals 
from their communications and interactions became relatively easy. 
The set up of the Global Position System in the early seventies 
by the USAian government [46] along with the worldwide free 
access to the Standard Positioning Service (SPS) provided precise 
location information. System to use of the GPS location is built into 
the current generation of mobile hand sets. The many applications 
available for the mobile phone made it possible to introduce services 
many of which require SPS. The use of the global positioning system 
allowed the various applications running on cell phones to keep 
abreast of the location of the user. Some of these, useful to the user 
for applications such as directions and maps, allowed the marketing 
of nearby businesses to the cell phone user. 

Furthermore, the applications on the mobile system allowed 
their developer to precisely know the whereabouts of the mobile 
device and hence its owner. These locations are recorded by the 
application developer and the supplier of the mobile operating 
system. Along with the web and the cell technology, the access to 
users data in their communications and by tracking their use of 
the web and applying the advances in computer science including 
data management and algorithms for machine learning etc. created 
what Zuboff calls Surveillance Capitalism[99]. The exploitation 
of personal data by private corporations is finally drawing the 
attention of scholars, and columnists and it is finally reaching the 
masses. The addiction to the cell phone means that one is constantly 
looking at it even when out in company with friends and family 
for a meal or just walking around. 

The recognization of corporations as a legal entity made it pos- 
sible to put in power politicians which reversed the progressive 
nature of the taxation in the wrong belief that there would be a 
trickled down effect from the haves to the have-nots. Unfortunately, 
this effect has not occurred and the growth of the middle class has 
been halted and reversed. Competition by having another company 
provide similar, privacy oriented service, seems hardly possible. 
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Because of the large share of the market another similar OSN, even 
one such as Google+ did not succeed. 


1.1 Exposure to Privacy in the Computer 
Science Curriculum 


Many ofthe current tech giants are headed by people who may have 
followeda computer science program and/or "tech-geeks" some of 
them being drop outs. One can safely assume that the majority of 
the coders could have had a computer science related education. 
However their exposure to humanities and social sciences would 
have been very limited if null as it is in many CS programs. The 
curriculum recommendation from ACM/IEEE includes the follow- 
ing: ^A computer engineering curriculum must include preparation 
for professional practice as an integral component. These practices 
encompass a wide range of activities including management, ethics 
and values, written and oral communication, working as part of 
a team, and remaining current in a rapidly changing discipline? 
However not much is said about issues of privacy and security 
except if it is not ethical. However, with the recklessness shown 
by the robber barons of the late 20th - early 21st century who go 
ahead like bulls in a china shop and seem to have no regards for a 
person's privacy. 

The sample curriculum for Computer science which runs into 
hundreds of pages[3] includes exposure of the student to Social 
Issues and Professional Practice and the documents point out that 
"Graduates should recognize the social, legal, ethical, and cultural 
issues inherent in the discipline of computing. They must further 
recognize that social, legal, and ethical standards vary internation- 
ally. They should be knowledgeable about the interplay of ethical 
issues, technical problems, and aesthetic values that play an impor- 
tant part in the development of computing systems. Practitioners 
must understand their individual and collective responsibility and 
the possible consequences of failure. They must understand their 
own limitations as well as the limitations of their tools? In the 
section SP/Privacy and Civil Liberties which is 2 Core-Tier1 hours 
where philosophical, legal, privacy tools and implication and re- 
lated issues are presented. This hardly seems adequate and is likely 
to run off the proverbial duck's back. Since some of the aspects of 
this responsibility is not encouraged to be practiced in the rest of 
the program the final impact is almost nil. What is puzzling is that 
in Appendix C of this document where course exemplars are given, 
there is not a single one just for SP/Privacy and Civil Liberties. 
One, given on page 304 on Social Issues and Professional Practice, 
is part of a course which includes Human Computer Interaction 
and Graphics and Visualization. Another example is Ethics & the 
Information Age [3], (p436) which however does not touch on the 
philosophical issue of property, person-hood and the right of a 
person to privacy. In Stanford university's CS program the course 
CS181 - Computers, Ethics, and Public Policy allocates a scant 1.6 
hours to Privacy & Civil Liberties [3] (p501). 

The privacy and security framework[19] of the Canadian Insti- 
tute for Health Information (CIHI), an independent, not-for-profit 
organization provides essential information on Canada's health 
systems and the health of Canadians. Most engineers and software 
designers are not very well exposed to privacy and may have been 
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exposed minimally to security. However they and the marketing 
people would likely ignore most of the issues in such frameworks. 

The privacy and security page of the USAian Federal Trade 
Commission(FTC) has the following about data security: “Many 
companies keep sensitive personal information about customers 
or employees in their files or on their network. Having a sound 
security plan in place to collect only what you need, keep it safe, 
and dispose of it securely can help you meet your legal obligations 
to protect that sensitive data. The FTC has free resources for busi- 
nesses of any size”[39]. The guidelines are only for self regulation 
and the penalty is fairly small; as reported recently, about 22 million 
USD[35]. The issue addressed by FTC is based on the agreement 
it had with Facebook for privacy but the FTC claims that the com- 
pany “deceived consumers by telling them they could keep their 
information on Facebook private, and then repeatedly allowed it to 
be shared and made public” [35]. Compare this paltry sum with the 
one in the guidelines for the EU which call for maximum penalties 
or 20 million Euro or up to 4% of the world wide revenue for a sin- 
gle breach which can add up to billions of Euros[7]. In the USAian 
system the privacy issues are being handled by a trade commission, 
not a human rights agency. 

Even though there is so much concern about security, there have 
been some large breaches in the recent years. Many systems store 
sensitive information such as passwords in clear text. The fact that 
the tech giants share information with third parties is enough for 
one to opt out of any system that needs third parties to carry on 
their central tasks. 


2 SOURCE OF PRIVACY VIOLATIONS 


The biggest sources of privacy violation are invisible. On-line shop- 
ping requires passing valuable personal information to big as well 
as small retailers. Some of them are fly-by-night ones while others 
are multi-billion dollar enterprises. Many small fries are on the 
coattails of specialized shopping portals. Many of these retailers 
to increase their revenues, turn around and sell the personal infor- 
mation to data aggregators. The portals also could have access to 
such data and can use it to direct publicity for products and service 
to the users and with use of cookies and trackers all this data goes 
into many different data repositoriesto be exploited, ad infinitum. 

While shopping or doing any operation on line, one is tracked 
by a myriad of trackers. A case in point is a session with one’s own 
bank. If one has a tracker reporting add-on in the browser, e.g., 
Privacy Badger, one see the trackers used by these banks to track 
their own costumers and share the data with these third parties! 
A question sent to the bank of why this is being done is never 
responded to! 


2.1 A Typical privacy agreement 


When a person signs up as a user of most of the ‘free’ on-line 
services or to services such as mobile phone supplier, she accepts, 
unread much less with a clear understanding of what it implies, 
their privacy policy which is linked to equally unread and not un- 
derstood data policy. These policies may be updated without the 
users’ consent. As an example if one considers the privacy policy 
[34] and the data policy [33] of Facebook, which runs to, in the 
version currently accessible, 7 and 9 pages respectively; it is no 
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wonder no one reads these and assumes that these privacy poli- 
cies mean that the site will keep her personal information private 
and would not share it without her permission[84]. Little does the 
unsuspecting person knows that she is giving away a free license 
to persons and organizers who believe that privacy is no longer a 
social norm[53]. 

The user is required to let the supplier of the service reserve 
the right to process, sell, trade or rent aggregated or the users 
information which is anonymized. As it is well known by now 
that most anonymizing scheme can be thwarted by combining 
information from multiple sources. The information that is up for 
grab includes! 

Personal information: including name, mailing(postal) address, 
email address, telephone number, IDs of accounts, device identifiers, 
PIN, service provider information, account including credit card 
credentials, passwords, records of all communication as well as 
details of contacts. 

Applications: All providers of applications have access to not 
only their own application data but also may share this data with 
other applications on the device. This looks like a modus operandi 
of all application developers and as Zuboff says, anything that is 
not guarded would be claimed by these new pirates. 

Back-up data on cloud: Would have access to users' personal 
information including contacts, email addresses, calendar, memo, 
tasks, display pictures, status messages, photos, audio, videos - the 
stated reason is to be able to restore this information. 

Cookies: These were introduced in the web space to overcome 
the stateless nature of the web protocol. The reason for stateless 
nature of web was due to the philosophy of free sharing of knowl- 
edge. However, cookies and their derivatives have morphed into a 
nefarious form to facilitate surveillance. 

Financial Information: Any transaction through the system 
may require credit status checking etc. any or all of which could be 
recorded and shared with other parties. 

Third party information: The service provider may combine 
your information with ones obtained from other sources. 

Retention Personal information:, Even after the expiry of 
any direct association with the service provided it could be retained 
perhaps in an anonimized form and may be used perpetually. 

Internationaloperations and onward transfers: The service 
provider, would require you to consent that your personal infor- 
mation may be collected, used, processed, transferred or stored in 
multiple jurisdictions. 

Communication: The service provider may communicate in- 
formation, surveys, marketing materials, advertisements or per- 
sonalized content. The service provider may share your personal 
information within the service provider and with their service 
providers, financial, insurance, legal, accounting or other advisors. 

Here are some of the things these systems have your permission 
to lay claim on! Any information and content you provide or they 
collect from creating or sharing content, contents of messages or 
communications with others. and all information provided while 
using any of their products including information of the account. 
They collect details about your connections, address books. logs, 


!The following is based on the privacy/data agreement of a number of organization 
including - Apple, Blackberry, Google, Facebook, etc. 
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meta-data and contents of all communications including all SMSs 
and emails; pattern of usage including what, when, where, who 
(and use their algorithms to try to figure out why!). All transactions 
made which includes purchases which would include the details of 
the credit/debit cards used, authentication information, addresses 
and contact information about the transaction. In addition they 
have access to actions taken by your contact and the information 
they provide. Your location information is used to determine where 
you live, where you go, what events you attend and where you are 
at any point in time. All this information is used to create targeted 
publicity which is tailored to influence you, using your foibles 
determined by their unknown algorithms. 

The proliferation of the internet via the medium of the web to 
offer all types of services requires a user to sign-up using a user 
name and a password. Since more and more services, such as news, 
financial. Governmental. social and commercial are now offered 
through the web a typical user may have scores of user IDs and 
passwords. The tech giants, to increase their presence, have offered 
to entered into an agreement with many of these services to let 
the users employ these tech giants credentials to log into these 
services. Thus the tech giants can trace the user not only on their 
own platform but can have access to what other services are being 
used and whatever other information the target service may provide 
the tech giant. What and how the information these giants would 
glean besides associating yet another data point in the profiles for 
these users is not advertised or communicated to the user. 


3 PRIVACY VIOLATION AT ANY LEVEL OF 
SHARING 


One of the culprits in the current loss of privacy is the USAian 
system, its constitution and the outlook of its capitalistic system. 
Whereas there are some forms of restraint for the USAian govern- 
ments collecting and using personal information in its constitution 
and amendments, the private sector is left alone to do as it pleases 
with a laissez faire self policing attitude. What the citizens do not 
trust the government to spy on is allowed to the private corpora- 
tions. That self regulation does not work is amply illustrated in 
the recent Boeing 737 Max’s design flaws which led to two deadly 
crashes. An optional display that showed the disagreement of the 
angle of attact sensors on the Boeing Max required additional cost 
to the millions for the plane. 

Furthermore, the fact that Boeing was able to get away with not 
having the Federal Aviation Agency(FAA) really act as an indepen- 
dent quality control shows that self regulation is unrelaible. Accord- 
ing to [31], [28]. “The problems were apparently compounded by 
FAA rules allowing manufacturers to essentially self-certify aircraft. 
Boeing reportedly tried to speed up the process in order to catch its 
rival Airbus A320neo, and pushed the FAA to give it more responsi- 
bility. There wasn’t a complete and proper review of the documents,’ 
a former Boeing engineer said. “[The] review was rushed to reach 
certain certification dates.’ [31]. The failure of the correct software 
and the required equipment for a high priced air-frame leads one 
to conjecture the type of security employed by many of these tech 
giants who have no regulation, no oversight and no competition 
and pay little taxes. They fail to reveal the breach of security or the 
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lack of it for months and years. According to the press, Google did 
not reveal a security breach for fear of regulations. [29] 

With current internet and wireless technology, people actually 
pay to use the free services in the form of internet connection 
monthly charges, buy and pay the connection fees for ‘smart’ de- 
vices that allow them to be tracked. Unlike criminals who are 
tracked by a tracking device imposed on them most consumers now 
carry a tracking device and pay for it handsomely, every month 
including for the bandwidth used for tracking. 

The result of the USAian system, where the tech giants are based, 
is that the private sector has laid claim on personal and private in- 
formation of the users of the myriad of devices that they own. Most 
of the smart devices are controlled by just two operating platforms 
again controlled by USAian tech giants. In addition, they control 
the application stores that users can download the ‘apps’ from and 
earn a percent of the fees for these applications. One wonders if this 
is not an example of a monopoly! Example of such laying a stake, 
like the one used in the gold rush of yore, is to claim all human 
experience as free raw material without any concern for individ- 
ual rights and without any payment of any source[99]. As Zuboff 
compares these to the edict recited by the Spanish conquistadors 
and later the settlers of the west in what is now known as the U. S. 
A. This edict gave the conquistadors and the settlers some form of 
divine rights which allowed them to usurp the lands of the existing 
people and displaced them or wiped them out[99]. 

Google made six cooked up declarations which confer on them- 
selves the right to translate the recorded experience of its users into 
behavioral data and own it, abuse, use and share it as they see fit 
and preserve these for perpetuity. They had no problem getting all 
this data since they had captured the search, the email, the the cell 
phone markets. They also were in control of the application market 
place for their cellphones. Another instance of conquest by decla- 
ration is the self proclaimed one by the Facebook founder which 
stated that privacy as no longer a social norm. This statement from 
a person with very little background in privacy was convenient 
since it was the basis of Facebook’s business model[53] and this dec- 
laration, along with a changeable data/privacy policy has been used 
to mine the information entrusted to them by unsuspecting users. 
Facebook’s usage of this data has been seen to violate the users 
privacy in many ways. This includes influencing them not only to 
buy products and services of questionable need but also to expose 
them to fake and biased news and help create targeted persuasive 
ads to influence a vote for doubtful candidates and proposals. It is 
no wonder, over the years Facebook has faced increasing scrutiny 
borne out by the number of times it has been cited by the privacy 
commissions, the courts and the popular press[30]. Facebook al- 
lowed phone company [58] and other tech giants access to access 
user data.[27]: they stretch and overstep privacy and competition 
laws and should be regulated urgently[58]. Others have [23] and 
want to take Facebook to court[45]. 

According to the summary of the final report[72] of UK’s Dig- 
ital, Culture, Media and Sport Committee: “among the countless 
innocuous postings of celebrations and holiday snaps, some mali- 
cious forces use Facebook to threaten and harass others, to publish 
revenge porn, to disseminate hate speech and propaganda of all 
kinds, and to influence elections and democratic processes—much 
of which Facebook, and other social media companies, are either 
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unable or unwilling to prevent. .....The big tech companies must not 
be allowed to expand exponentially, without constraint or proper 
regulatory oversight. But only governments and the law are power- 
ful enough to contain them. The legislative tools already exist. They 
must now be applied to digital activity, using tools such as privacy 
laws, data protection legislation, antitrust and competition law. If 
companies become monopolies they can be broken up, in whatever 
sector. Facebook’s handling of personal data, and its use for polit- 
ical campaigns, are prime and legitimate areas for inspection by 
regulators, and it should not be able to evade all editorial responsi- 
bility for the content shared by its users across its platforms.’[89] 
Even the people who were involved in the early days of Facebook 
and its mentor seem to agree with the findings of this and other 
reports. [66], [48] After having collected millions of email addresses, 
Facebook says they would stop this practice and notify users[47]. 

Facebook has used parental influence to mold UE laws[40] and 
put pressure on politicians, around the world, by promising local 
investment such as installing data centers in exchange for lobbying 
for the company to block privacy laws and any forthcoming laws 
should be Facebook friendly[16], [90]. The fact that the earnings the 
companies make by their presence in a country is not being taxed 
is something that the tech giants have been successful in protecting 
and they continue to lobby for it[90]. Facebook allows governments 
to target individuals and groups to the extremes, e.g., Rohinga 
genocide[51], [55] The new virage of Facebook to privacy seems to 
be fake and meant to decrease their civil liabilities and in fact yet 
another business spin to try to protect their dominant position and 
keep at bay the regulations and corporate breakup[13] [88]. Some 
demands for investigating the lobbying of tech giants are ignored 
by those in power who hope to benefit from their largess at the 
election time[71]. 


3.1 Examples of privacy violations 


Over the years, there have been many instances of violation of the 
common notion of privacy. Even the blanket surrender of privacy 
in the privacy agreements of the tech, giants is often not honoured, 
much less the notion of privacy formed over the last few centuries. 
An overall view is recently reported in [93] that Google’s street 
view violates privacy by taking videos of private homes spaces 
along with people therein and publishes without any authority. 
When met with resistance, held off and returned when no one was 
looking. 

Facebook Beacon published purchases made by users without 
their express consent. Facebook uploaded email contacts of 1.5m 
users without consent and when discovered says it was inadvertent. 
Actually it used a feature of a previous version. As usual the infor- 
mation mined from the user contacts and propagated into other 
databases may not be deleted but used. More of deny, deflect etc. 

Google says a microphone in one of their products, which was 
not revelaed to the buyers, was never activated; one has to take 
this with a grain of salt when the courts have to tell them to take 
down world-wide, search results of selling on the web products 
manufactured in violation of trade secrets [18] There have been 
many instances of tech companies being warned about privacy. One 
such is the report by Denham the Assistant Privacy Commissioner 
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of Canada[26]. At that early date the report concludes “that Face- 
book did not have “safeguards in place to prevent unauthorized 
access by application developers to users’ personal information, 
and furthermore was not doing enough to ensure that meaningful 
consent was obtained from individuals for the disclosure of their 
personal information to application developers” [26]. 

There is the class action suit against Facebook that is going 
on for years in British Columbia and the company has used all 
its resources to keep this from being resolved. The case concerns 
the practice used by Facebook as of 2011 to feature, users’ ‘likes’ 
in publicity without the explicit users’ consent. The class action 
was filed in May 2014[23]. The company denied it saying that the 
consent was automatic and fought it all the way to the Supreme 
Court of Canada and after many years, the case was won by the 
plaintiff and the class action was returned to the BC courts after 
close to four years. It may take a few more years before the class 
action suit is decided and of course there would be appeals and 
likely trips back to the supreme court. In the meantime most people 
would give up and this is what companies with deep pockets able to 
hire the best lawyers count on. For not obtaining explicit consent 
from users to use their data, Facebook is facing a fine of up to 5 
billion USD from the USAians Federal Trade Commission. 

Companies claim that they protect your data; however, it seems 
that in fact they exploit it and being hacked as reported in the 
popular press time and again. The number of breaches of data from 
companies is affecting more and more people since the early days 
when Apple stored passwords in the clear and had to grudgingly 
own up[93] to it 


3.2 Childrens’ Privacy 


Children’s Online Privacy Protection Act (COPPA) [38] this two 
decades old USAian federal act protects childrens privacy by giving 
parents tools to control what information is collected from their 
children online. The personal information consists of: a first and 
last name; a home or other physical address including street name 
and name of a city or town; an e-mail address; telephone number; 
a Social Security number; any other identifier that could determine 
the physical or online contacting of a specific individual; or infor- 
mation concerning the child or the parents of that child that the 
website collects online from the child and combines with other 
identifiers . A number of tech giants have been fined under the 
COPPA violation. TikTok is a OSN for video-sharing application 
and it is alleged to not seek parental consent before collecting in- 
formation from children under 13 years old[87]. The company is 
by the governments in India and Bangladesh and has been fined in 
the USA[86]. 

Other on-line tech giants let children run up credit card charges 
using in application charges while playing games on devices such 
as iPad and iPhone. This kind of preying on children has been going 
on for a long time as illustated in a story involving Farmville, a 
Facebook game, reported in 2010[52]. 


3.3 Legal Actions 


The Privacy Commissioner of Canada had launched an investigation 
in 2018 to examine if Facebook’s practices are in compliance with 
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Canada’s federal private sector privacy law, the Personal Informa- 
tion Protection and Electronic Documents Act called PIPEDA)[69] 
However, this was not the first time: There were early warning in 
2009 about the privacy issues with OSN such as Facebook [63], [83]. 
Many of the complaints found Facebook to be in contravention of 
the Act and Facebook was to take corrective measures. However, as 
in the class action launched by Deborah Douez, the case has been 
going on over many years and is yet another example of the deny, 
deflect and defend mentality of these tech giants[23], [24], [25]. 
In a more recent report of joint investigation of Facebook by the 
Privacy Commissioner of Canada and the Information and Privacy 
Commissioner for British Columbia the conclusions drawn are that 
Facebook failed to obtain valid and meaningful consent of users 
nor their friends. Furthermore the company did not have adequate 
safeguard to protect users information and was not accountable for 
the information under its control[73]. The selective restriction used 
by Google for example in Google v Equustek Inc, was found to be 
not sufficient and the request for a world wide ban was upheld. 

The availability of free widely used OSN platforms allows any 
one to post anything on it. The posters range from ignorant and 
zealot bigots, paid geeks, agents of governments to misinformed 
twitters. After many denials and deflections some of these OSN 
are finally admitting that their platform is a vehicle for fake news 
etc. [43] and making a feeble attempt to do something. Where the 
attempt is lack-luster, a mere 40 people, to fight millions of potential 
sources of fake news. The company is making sure to get as much 
spin out of it as possible by inviting dozens of journalists into the 
‘war’ room to fight this fake news; there being a claim that these 
crews are backed by other unnamed and unseen experts and of 
course the unknown, unproven algorithms! 

While these tech giants claim to be not evil and want people to 
connect, they are in fact exploiting the recorded human experience 
to enrich themselves. By using the leverage of different kinds of 
equity(more than one vote for some types of shares, no vote for 
others or/and and not allowing some of the voting shareholders to 
vote against members of the board), they retain the majority voting 
rights and make sure that the reins of these tech giants are preserved 
in a dynastic fashion. The security system of their host country 
(USA) allows this type of capitalism. The USAians, who seem to 
not question such practices to encourage growth without much 
social good, are responsible for this dystopian statuses existance 
and continue to degrade human existence not only in their country 
but in most other countries. The exception are those countries who 
have put in safeguards and nurtured their own tech giants. 

Any challenge to what these tech giants have usurped and now 
own is to take legal action which except for a few is beyond[23] 
the economic means and personal energy, commitment and moral 
resources of the rest of us. 


4 WAITING FOR A SOLUTION 


The protection of privacy, a human right, under threat from tech 
giants and goblins that they create requires some action. This could 
be either in the form of political and legislative and the form would 
be regulations and legislation with sizable penalties proportional to 
the income of the culprit, taxing the income etc. Another approach 
to be used is to set up national service for what now has become a 
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way of many communications. The third approach, presented in 
the next section, is a technical solution to render the tech giants 
obsolete! 

Some opinion expressed in the press for handling the tech giants 
is to recognize the service they provide as public service and either 
provide a national service under the control of an independent 
neutral organization and/or socialize them[82]. They have monopo- 
lized a number of services that they have usurped or re-engineered 
and made the population addicted to them. The addiction is evi- 
dent in the homes, offices, public places and social get togethers 
where everyone constantly glances at their hand held devices[68]. 
These addicts are waiting for the next shot! No one seems to have 
recognized this addiction. 

Waiting for a political solution is like En attendant Godot [14] 
but Godot never comes. The bent politicians are not in a hurry 
nor seem to have the moral strength to breakup these tech giants. 
The addiction that has been created with the so called free ser- 
vices has kept the politicians at bay. No thought has been given in 
any government to set up a national email service as an essential 
public infrastructure much as health, postal, road, school or train 
service. Even the tel-comm service is regulated in most countries. 
Since the internet depends on the tel-comm service it should be 
regulated with the tech giants at least held responsible for the con- 
tents. They should be taxed on their earnings in the jurisdiction 
where it is earned; there should be a penalty for the jobs that are 
shipped outside the country and for importing and exporting data. 
The tax should be at a progressive rate where the majority of the 
excess profit is taxed. This may encourage the tech giants to set up 
jurisdictional data farms to serve local emails, social contents. 

Douthat[30] compares the western internet dominated by the 
USAian tech giants and the Chinese one dominated by the central 
government. The result in the western is the addiction generated 
by the internet and the control of it by a few corporations which 
at times work with the government and mistakes made on it are 
magnified. Lies and fake news are spread by it and real news is, by 
repetition from the top, labeled as fake news[30]. 

Cryptocurrency has evolved much later than search engines. Its 
spread is liable to upset the financial sector and the basis for the 
support of the political system everywhere much as the so called 
open internet has done by concentrating the imperialistic nature 
in the hands of a few tech giants all under the USAian form of 
capitalistic protection. However, the move to regulate Cryptocur- 
rency has already begun in the form of legislators in various parts 
of the world. Regulation of the tech giant to respect the privacy of 
its users and not exploit their personal information to manipulate 
them is missing. 


4.1 A possible start 


One of the principals of privacy in the European Union's General 
Data Protection Regulation(GDPR) is that a person is the owner 
of her data and she has the right to decide who can use it and 
how. Regardless of where and how the data is shared, it can be 
amended, deleted or she could determine who and how it would be 
accessed [32]. GDPR went into effect in the EU in May 2018[42]. Its 
objective is to give control to individuals over their personal data 
and requires any organization who collects and controls personal 
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information to have in place appropriate measures, both technical 
and logistic, to implement the data protection principles. 

Such organizations are required to disclose their legal basis and 
purpose of data collection operations and have publicized the period 
of data retention and and the sharing of it with third parities. The 
data collecting organization are required to provide, to any data 
subject on request, a portable copy of the data collected ina common 
format. The data subject has the right to have their data corrected 
or even deleted. There are penalties for violation of this regulation, 
Under the violation of this regulation, recently France has fined 
Google 5.7 Million USD[75]. 

In the few months of coming into force of GDPRGDPR, the US- 
Aian government is finally waking up to some form of legislation 
for consumer privacy[59], driven ironically, not due to concern 
for consumer privacy but as a another component of high tech 
competition as outlined by Apples CEO[21], [50]. In the meantime 
activists are filing an increasing number of complaints under the 
GDPR[6]. In spite of the protection afforded by GDPR, it still al- 
lows the fundamental rights of the data subject not to override the 
business' legitimate interest of the data processor! 

GDPR applies only to the EU, but given the scale of the market, 
many companies are deciding it's easier — not to mention a public 
relations win - to apply its terms globally. The problem is that 
even if there is a directive, even from a court, tech giants seem 
to consider themselves immune to these. A very recent example 
of this concerns a ban put in by a New Zealand court to name an 
accused killer. The local media companies, against who the court 
could take action, use resources to make sure such court bans are 
respected not only by themselves but also by their own social media 
channels. Google which does not apply bans globally and in line 
with this policy of geo-blocking (which is basically not being bound 
by local blackouts globally but only in the jurisdiction concerned) 
had emailed it out to users, apparently not in New Zealand, who 
had signed up for “what’s trending in New Zealand" [57]. 

The effect of GDPR is being felt on this side of the Atlantic and 
accessing, for example a proper notice about cookies and use of 
analytics has to be given to EU citizens when they access USAian 
web sites: as usual there is a agree or not agree option! As usual 
it is too tempting to agree instead of looking at the privacy policy, 
third party partners or terms of service which are many pages long 
as pointed out earlier. 

Competition by having another tech start-up to provide similar 
service seems hardly possible. Because of the large share of the 
market another similar OSN, even one such as Google+ did not 
succeed. Other avenues being used in the EU is to allow competition 
by blocking the tech giants from buying start-ups who may become 
a serious challenger some day. Such acquisitions have been allowed 
to proceed in the USA to date: buying of WhatAp and Instagram by 
Facebook are examples. The European model where the dominant 
giants are forced to share the data [32] goes back to the conclusion 
of the Workshop A held in April 1995 which recommended search 
engines share information[9]. 
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5 WAY OUT OF THE PRIVACY AND 
SECURITY TRAP 


The curernt situation where a small number of Usaian tech giants 
are controlling the web, all human knowledge and experience; 
they are manipulating awareness and beliefs to serve their aim of 
continued domination and maximizing profits. Their huge profits 
allow them to buy out any potential competition and are moving in 
new directions every day. This, after all has important elements in 
common with imperialism and totalitarianism. No surprise then that 
a country which experienced the latter most dramatically, Germany, 
has some of the strongest laws to safeguard privacy. Even still these 
efforts are merely corrective and merely polices the problem; not 
solve it. To actually overcome this system, a new solution is needed. 

It is unlikely that many politicians who are heading governments 
or are part of the government have much motivation to do anything 
about privacy. The existing laws have no teeth and the tech giants 
are happy to put up the three big Ds(deny, deflect, delay). Each year 
they can delay the action, they are more established, made a few 
more billions and were able to finance more elections and place 
their men(mostly) in the drivers seat. 

There are many political ideas put forward by various aspiring 
politicians in the western world. This is so in the prelude to the 
USAian 2020 presidential election. They include breaking up the 
tech giants, giving more control to the users of their data, making 
the algorithms transparent etc. None of this may work; take for 
instance making the algorithms more transparent; most users who 
don't even read the privacy agreement would not be able to under- 
stand the working of the algorithms. It is also doubtful that the tech 
giants would ever be willing to make their algorithms transparent. 

The other idea is to increase competition; however this is also 
a no starter. The tech giants have big market capitalization and 
have politicians in their pockets. They make all possible effort to 
influence politicians since they have direct lines to the ministers 
and presidents. As a result we are proposing here a method to turn 
the clock back and bring home all communications and the data 
that is shared. 


5.1 Lifting the cloud 


Most users of the ‘free’ services would not have read the privacy or 
the data-use policy when they sign up for these services. Reading 
these policies which are many pages long would be confusing with 
all their exceptions, and fighting any of its effect leads to years of 
battle in courts as is evidenced by the case cited earlier; such drawn 
out cases would exhaust the emotional energy of most users. 

Web is a relatively recent way of doing things and as in many 
facets of human existence the way to do things swings from one way 
to another like a pendulum. Computing is no exception. We started 
with the idea of a ‘one of computing system’ which would have been 
used to produce useful mathematical tables which would be printed 
and shared. In reality, this is not what happened. Computers were 
developed as a proof of concept and from there went to become 
what was called “main frames’ - expensive and bulky systems. They 
were time shared by many users locally or remotely using dedicated 
telecommunication lines. 

In the nineteen-sixties there were two trains of developments. A 
family of main frames were affordable enough to be used by many 
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organizations to have their own computer systems and software de- 
velopment teams. At the same time mini-computers were developed 
to be used by smaller organizations and labs. The mini-computers 
evolved into the mini-computers and personal computer (PC) in the 
late 1970s and many people were able to have a personal desktop to 
do their own processing. The personal data was housed in the hard 
disk of the PC. Development of the hard disk technology allowed 
increase in speed and capacity. It was possible to store all local 
information locally. 

The development and the spread of internet in the 1980s and 
then the world wide web starting in the early 1990s along with the 
graphical browser allowed the non-tech savvy person to be con- 
nected. The misdirection of the web by mainly commercial interests 
and the opportunity to claim uncharted territories prompted many 
tech buccaneers and geeks of the “dot.com” craze to start violating 
unwritten traditions and using and introducing surveillance tools, 
and thereby were able to amass huge troves of information. 

The lack of the postal services to see electronic mail as a new 
public postal service, the ignorance and self-interest of politicians 
allowed the lack of regulation in the new domain allowing owner- 
ship of personal information of hundreds of million of individuals. 
The first incursion of private venture capitalists were in the do- 
main of web search and email and the early companies included 
Altavista, Yahoo. Excite. Lycos. Even though web search engines 
started appearing in 1993, it was a later entry which captured the 
search market. Even though most search engines produce similar 
results, the habits and default setting in browsers tend to prioritize 
one. 

As pointed out in [99] the concentration of data by such orga- 
nizations is making it difficult for competition to be effective. The 
EU has ruled against Google many times in recent years; all of 
these are fought in the courts and the monopoly continues. The 
habits of people to flock to a system where others are and hence 
believe to be a better system has worked against titans as Google 
was forced to shutdown Google+ their social network. Not waiting 
for the breakup these tech giants and believing that less is better we 
propose here for users to take back control of their data, lives and 
privacy by offering them to host their own email and web server 
and setting up their own social network. 

In a previous work we have pointed out the privacy issues with 
the increasing number of IoTs which transmit personal information 
to the servers of the makers of the IoTs. The key there was Heim- 
daller and the setting up of a Software Assurance Agency(SAA) 
[1]. This agency, is an independent one and requires that any de- 
vice manufacturer must submit all software and updates to it for 
verification. It is independent and hence not run by a tech giant. 
No software unless it is certified for suitability would be certified. 
Unlike the ‘stores’ run by tech giants, SAA does not get a percent 
of the revenue for the software; however it charges a fee based on 
the size of the corporation. It is felt that there is a need for an inde- 
pendent organization such as SAA for the software industry much 
like the certification authorities CSA and UL. Here we propose to 
extend Heimdaller to not only monitor the IoTs but also act as a 
server for a personal email system and the web. 

There are many systems that allow users to create their own web 
pages: an example is Facebook! Considering the number of articles, 
and litigations it has generated it is time that instead of giving away 
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all this information to a corporation and sharing it with strangers, 
a personal web server could be used to allow the personal web page 
accessible only to the immediate family and friends. 

The fact that a micro processor such as Raspberry is very afford- 
able and is suitable for driving a personal email and web server 
with very little load and bandwidth need; that solid state memory 
and drives are now very affordable and could provide sufficient 
secure storage for the family server. The system would have its 
own storage and backup system; hence all storage of the family 
data, emails, web pages, comments etc. would be stored locally 
and there would be no need to use a cloud and thus deprive tech 
giants of the free raw material(data) and an opportunity to mine 
this information for their own profit. The proposed system, hence, 
include processing and storage. With a cheap processor such as 
Raspberry 2 and SSD the modem functionality required in private 
homes to connect to the internet through the intermediary of an 
ISP takes on the function not only of a sentry but also of a data 
vault. 

As Arendt [5] in her chapter on Imperialism, talking about Cecil 
Rhodes, quoting the words of Millen “expansion is everything”[67]. 
Rhodes, looking at the stars and planet fell into despair since he 
wanted to annex all the planets for the British empire that he adored. 
Much like Rhodes some of the tech giants consider growth to be 
the good thing regardless of the collateral harm it does[66]. For 
instance Facebook knew that its platform could expose someone 
to bullying and coordinates terrorist attacks, This blunt memo by 
Bosworth recognizes that and noted that “The best products don’t 
win. The ones everyone use win? [66]. With products like his, more 
users are attracted and they invite yet more! 


5.2 Proposal for a Technical Solution 


Breaking up the tech giants is not going to happen soon nor would 
an alternate commercial service start with the monetary power of 
the existing tech giants. They have the resources and staying power 
to bankrupt, buy and squash competition[98]. They have hundreds 
of lawyers working for them and connections to the highest level of 
the governments. The proposed system includes a modest processor, 
an email and web server and a light weight database and a new 
generation of modem router. The system addresses the biggest 
source of privacy violation: email and web presence. 

Our proposal is simple; add the functionality of an email server 
and a web server to the modem-wireless router that mostpeople 
now have in theirhome of small office. This requires the adding of 
simple interfaces to allow even the most non-tech savvy user to 
mange these servers. All emails will originate in the users' owned 
system; the personal web server would host the persons web pages 
and all the contents would be stored locally. Access to the web 
server would be limited and the data could be shared as appropriate 
with various level of security. Only invited persons would be able 
to access any contents and since the user is in control of the web 
server and all its contents, she has the full control. Heimdaller is the 
gate keeper and all interaction of the Internet and IoTs including 
those coming from the users and the IoT maker goes through the 
gatekeeper. All software updates, have to be submitted to the SAA 
which verifies them and if they pass the tests of functionalities, it is 
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Figure 1: Proposal for a Technical Solution 


certified and accessible to Heimdaller. Only SAA verified software 
could be installed in the the system 

There would no longer be the need of any tech giants to provide 
email or web service. Technology has progressed to such an extent 
that these services could be incorporated in a device many home 
owners already have and its cost would be no more than that of 
a latest mobile device. The system would provide, each family in 
their own home an email server, a web server with their own family 
pages where they can post their news and share it with family and 
friends. The web server and the user interface would be such that 
expertise in making web pages would not be required. It is expected 
that the basics of internet usage, emailing, on-line chatting etc. 
would become programs in school. By including encryption in both 
the email and web contents, the leak of contents by eavesdropper 
is avoided. 

The above development would mean that the not-really free 
services offered by the tech mammoths would not be required. So 
instead of waiting for a political solution from bent politicians, we 
are proposing a technical solution which would be created and 
maintained by the volunteer open source community and financed 
by required contributions from corporations making devices or 
software and donations by users. 


6 CONCLUSION 


The current practice of tech giants can only be neutralized by a 
technical solution where, their service would not be required. Once 
each family and businesses have their own static IP address and 
a hardnedconnection to the interenet with a server that provides 
emailand web service one becomes independent.The web server 


would not allow any robots and the gatekeepr, Heidammlalr would 
not allow any untrusted/uncertified software to be installed in the 
system. The development of such a sytem is the next challenge of 
the academic community! 
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Appendix: The web from an early participant 


Soon after the so called official inauguration of the world wide 
web(WWW) in the form of the first WWW meeting in Geneva, 
a flurry of activities were held in U. S. A. This included a rushed 
announcement by the National Center for Super-computing Ap- 
plications(NCSA) of ‘Mosaic and the Web’ conference, which was 
renamed WWW II, and was held in Chicago. Whereas the first 
was announced by Robert Cailliau the second was spearheaded 
by NCSA Mosaic[78]. One of the early resolutions of these two 
meetings was raised in the Navigational and Priority workshops 
held during the first world wide web meeting (WWW-1) in Geneva 
in 1994. Other activity in the first days of the web was one in July 
1995: it being a forum held by the USA National Science and Tech- 
nology Council's Committee on Information and Communication 
in Lister Hill Center (Bethesda, MD) entitled America in the Age Of 
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information. A number of White Papers were presented[8]; look- 
ing through the list one finds that none of the white papers had 
touched on the issue of privacy. There was, but one, presentation 
on security. 

During the subsequent early WWW meetings, some of the peo- 
ple involved in the navigation priority workshop devised various 
mechanisms for search in the new web. This included the WebJouor- 
nal [10] the support of robots and soon thereafter the early search 
engines. During WWWIII, in Darmstradt, the pioneers of the early 
search engines felt that to provide for the financial needs of the 
search engines, the side panel to display paid publicity would be 
appropriate. This way the paid publicity would be separated from 
the search results. This was the method used until a late arrived: 
initially, this new system was idealistic but soon became, under 
pressure from the venture capitalists, one of the leaders of the what 
has been termed the digital gangsters. All these systems are based 
on collecting huge amounts of personal information about the users, 
be it from free emails, or postings made on one of the online social 
networks (OSN) 
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ABSTRACT 


The field of artificial intelligence (AI) is constantly growing and 
finding new ways to solve real world problems. One of the AI 
knowledge and research fields is natural language processing (NLP) 
which attempts to categorise and process human language data in 
an effort to utilise machines to understand humans. 


Among the most used applications of NLP is Sentiment Analysis. 
This is because, in addition to other reasons, Sentiment analysis is 
about understanding how humans are feeling related to an action 
or event, what could give to the companies with an online presence 
the power to understand the opinions of their customers online. 


A commonly used weighting factor to measure the perform of 
sentiment analysis is the tfidf. In this study we compare several su- 
pervised learning methods using the tfidf values in order to identify 
the most accurate model to analyse sentiment. 


After we observe which is the best classifier based on metrics 
and other parameters we will do a real application of sentiment 
analysis with twitter data. 
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1 INTRODUCTION 


The field of artificial intelligence (AI) is constantly growing and 
finding new ways to solve real world problems [19]. One of the AI 
knowledge and research fields is natural language processing (NLP) 
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which attempts to categorise and process human language data in 
an effort to utilise machines to understand humans. 


Natural Language Processing is an artificial intelligence field fo- 
cused on enabling computers to understand, process and act based 
on human languages, getting computers closer to a human level 
language understanding [14]. Some advances in Machine Learning 
have enabled computers to do many useful things using NLP tech- 
niques and deep learning [23] such as online language translators 
or semantic understanding [7]. 


One of the most popular and important uses of NLP is Sentiment 
Analysis [20]. With this technique we can build systems which 
attempt to identify and extract opinions or sentiments from oral 
speaking or written texts [2]. This type of analysis is very important 
for companies because they can take data from customer’s opinions 
and thus make improvements to their businesses. 


Since computers are not able to understand human expressions, 
we need to create a binary vector in which each word has its own 
position which is called vectorize. 


Once it has been vectorized, a term frequency- inverse document 
frequency (tfidf) weighting factor can be applied [5]. This new 
vector based on word frequencies will be the input of all models 
we will train. It is calculated as the product of the term frequency 
and the inverse document frequency [13]. Since m is the word and 
1 the document, tf(m, 1) is the number of times that m appears in 1. 
We can see the mathematical expression in Equation 1. 


f(m, D) 


Huge max{f(m, 1) : me l} 


(1) 


The term idf refers to the inverse frequency that consists of 
knowing if the word is common in a set of documents. This value, 
as we can observe in Equation 2, is obtained by dividing the total 
number of documents (L) by the number of documents that contains 
the word, and then the logarithm is taken [16]. 


IL 


idf(m, L) = log HeL:mel}| 


(2) 


The final value is calculated as the product of both, as we can 
see in the Equation 3 [22]. 


tfidf(m, L L) = tf(m, I) x idf(m, L) (3) 


A high tfidf value signifies a high frequency of the word in the 
document and a small frequency of occurrence of the instance in 
the set of documents [9]. 
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In addition to NPL, in the broad world of the use of Artificial 
Intelligence, we find the application to data science. Data Science 
fundamentally consists of processing, analysing and creating mod- 
els in order to extract information from data. Deeper within this 
field, machine learning is responsible for training computers to 
learn based on data. The types of machine learning methods could 
be divided in supervised, unsupervised and reinforcement. 


Supervised learning requires that the machine be trained with 
previous data in order to obtain a model that can be applied to 
new input data. On the contrary, the unsupervised learning does 
not require training, it groups data in k groups to be able to make 
profiles or join data with similar behavior. Finally, reinforcement 
learning is based on rewards and penalties. 


The present study aims to identify to most accurate supervised 
learning method for sentiment analysis. Performing this type of 
analysis is very useful for companies as it allows them to know 
consumers opinions and thus, they can use this information to 
improve products, departments or marketing strategies, among 
others. In this article we will analyze texts, taking into account 
different supervised learning methods, as well as an artificial neural 
network (ANN), in order to compare and identify the precision of 
each technique [3]. 


After this introduction the rest of the paper is structured in 
five sections. The next section we will define the data used in 
this experiment. The section number three is about the research 
methods used as well as metrics used to compare them. After that we 
will see the main results that we have seen during the experiment, 
we will discuss about these results and also we will test the best 
model with new twitter data. Finally we will see the conclusions 
that we can take out from this experiment. 


2 DATASETS 


The data used for the present study belongs to the corpus denomi- 
nated yelp labelled and amazon cells labelled [15]. Both datasets 
have, as it can be seen in the Table 1, two columns called Tweets 
and Labels. 


Table 1: Datasets [6] 


Yelp Labelled 
Tweets Labels 
Wow... Loved this place. 1 
Crust is not good. 0 
The selection on the menu was great. 1 
Amazon Cells Labelled 
Tweets Labels 
There is no way for me to plug it in here. 0 
Good case, Excellent value. 1 
What a waste of money and time!. 0 


Since each corpus has 1000 lines of data correctly labelled, we will 
use a corpus to train and after that we will use the second corpus 
for validation. Validation with x;es; will provide us with Upred- 
Since we already have the correct labels in the validation set, we 
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will compare Upred and yżest in order to asses if the classification 


is correct. 
Table 2: Twitter Real Application 


Tweets 


Have a great day as well! We have 
many more promos planned. Stay tuned! 


I would be more than happy to further investigate 
this transaction. 

That’s AMAZING! Congrats on the 20th year. 

We greatly appreciate you as a member. 


Once we have done the models training part with data from 
Table 1, we will perform a quick test using the best classifier with 
tweets which will be downloaded from the twitter API REST. In 
this case we will not have labels for each instance, as we can see in 
Table 2, that is what really happen on companies. 


3 RESEARCH DISCUSSION 


Supervised learning methods are algorithms which base their learn- 
ing process on an accurately labeled training data set [4]. This 
means that for each occurrence of the training data set, we know 
the value of its target variable. Its use is limited to classification or 
regression [17]. In this article we will use these kind of methods to 
sentiment analysis so we can see which is the most accurately of 
them. 


First of all, data will be vectorized as we saw in the previous 
section and a tfidf weighting factor is applied thus, the inputs of 
each classifier will be based on the frequency of each term. 


After that, since we have two corpus correctly labeled, we will 
use one of them to perform the training process and the other 
one to test. The validation labels will be temporarily removed so 
once we have the predictions from the trained model we could 
compare them with the real labels, being able thus to extract data 
from correct predictions. 


We will apply several supervised methods so we can see the 
differences between them, taking into account metrics as confusion 
matrix and Area Under Curve (AUC) [8]. The confusion matrix 
is a graphical visualization of how accurately is the model and is 
composed for four measures as we can see in Figure 1. 


Figure 1: Confusion Matrix 
Actual Values 


Positive (1) Negative (0) 


Positive (1) TP FP 


Negative (0) FN TN 


Predicted Values 


Being TP (True Positives) which means that the data was cor- 
rectly predicted; FP (False Positives) indicates that some data were 
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predicted as positive but it was a wrong prediction. FN (False Neg- 
ative) is defined as data that are predicted as negative but are in- 
correct and TN (True Negative) that are data correctly predicted as 
negative. 


Using the confusion matrix terms we can calculate other metrics 
that we will take into account to compare our models as Recall or 
True Positive Rate (TPR) which measures the rate of true positives. 
It is calculated as the following mathematical expression: 


TP 
Recall = TPR = ————— (4) 
TP + FN 
Precision is another metric that measures performance related 
to positive and negative rates and is calculated as the following 


expression: 


TP 
ae (5) 
TP + FP 

Normally precision and recall are combined in a metric called F1 
which can be calculated as the following equation: 


Precision = 


Recall « Precision 
Fl= 


(6) 


Another parameter we need to calculate AUC is the False Positive 
Rate (FPR) which we can calculate applying the next expression: 


* 
Recall + Precision 


FP 
MES 5p 4 TN e 

The other metric we are going to take into account is Area Under 
Curve (AUC) which measures the performance of the model. It is 
created by plotting the True Positive Rate (TPR) against the False 


Positive Rate (FPR) which is calculated as follows [12]: 


1 TP " TN 
2 \TP+FN TN+FP 


Once we calculate the metrics for each model and see which is 


AUC = 


(8) 


the best one, we will download tweets from API REST to classify. 
In this case we will not have tagged each instance as companies 
work so we will use the best classifier based on metrics calculated 
before. 


In the following subsections we will see all the supervised learn- 
ing methods that we use for this experiment, seeing for each method 
the AUC and confusion matrix metrics. 


3.1 Naive Bayes 


Naive Bayes is a probabilistic classifier based on the application 
of Bayes theorem and the independence hypothesis between the 
predictive variables. This classifier assumes that the presence of 
a particular feature is not related to the presence or absence of 
any other feature, it considers that every single feature contributes 
independently to the target probability [10]. 


Then, evaluating the model with confusion matrix metric, we 
can observe that in 750 cases the prediction was correct. These 
correct data are composed by 355 predicted correctly as positives 
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and 395 predicted correctly as negatives. By the other hand, we 
find that in 250 cases the prediction was incorrect. These incorrect 
data are composed by 145 predicted incorrectly as positives and 
105 cases which were predicted incorrectly as negatives. 


Figure 2: Naive Bayes Metrics 
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0.0 0.2 04 0.6 0.8 10 Predicted label 


As we can see in the previous Figure 2, this classifier reaches an 
AUC of 0.75. It can be observed in the confusion matrix that there 
are more failures in the negative results than in the positive ones. 


3.2 Support Vectorial Machine 


Support Vectorial Machine (SVM) is a kind of supervised learn- 
ing method which creates models that represent points in space, 
separating the classes into two spaces as wide as possible [1]. 


These separations form hyperplanes, defined as the vector be- 
tween the two closest points of each class, which are called support 
vectors . When the new samples are in correspondence with the 
trained model, they can be classified to one of these classes. SVMs 
have the ability to construct a hyperplane or set of hyperplanes in 
a high dimensional space [10]. 


Figure 3: Support Vectorial Machine Metrics 
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Once we have trained our model, we can see that in 733 cases 
the prediction was correct. These correct data are composed by 
346 predicted correctly as positives and 387 predicted correctly as 
negatives. By the other hand, we find that in 267 cases the prediction 
was incorrect. These incorrect data are composed in 154 predicted 
incorrectly as positives and 113 predicted incorrectly as negatives. 


It can be seen in the Figure 3 that the AUC that reaches the 
vector support machine algorithm is 0.73, classifying better the 
positive data than the negative. 
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3.3 Logistic Regression 


Logistic Regression models have become an accepted method to 
obtain binary outcome variables [11]. It is a type of regression 
analysis used to predict the outcome of a categorical variable based 
on predictor variables. The main idea of Logistic Regression is 
that it approximates the probability of obtaining 1 or 0 with the 
explanatory variable x value. 


After training the model we have that in 741 cases the prediction 
was correct. These correct data are composed by 351 predicted 
correctly as positives and 390 predicted correctly as negatives. By 


the other hand, we find that in 259 cases the prediction was incorrect. 


These incorrect data are composed by 149 predicted incorrectly as 
positives and 110 predicted incorrectly as negatives. 


Figure 4: Logistic Regression Metrics 
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As we can observe in the Figure 4, the model reaches a 0.74 
AUC and separates, with the same quality as the previous model, 
the positive and negative data. 


3.4 Tree Decision 


A Tree Decision is an artificial intelligence algorithm that is based 
on making diagrams. It consists on represent and categorize a series 
of conditions that occur successively [18]. 


Figure 5: Tree Decision Metrics 
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We can see that in 668 cases the prediction was correct. These 
correct data are composed by 184 predicted correctly as positives 
and 484 predicted correctly as negatives. By the other hand, we 
find that in 332 cases the prediction was incorrect. These incorrect 
data are composed by 316 predicted incorrectly as positives and 16 
predicted incorrectly as negatives. 


It can be seen that the AUC is 0.76. In contrast, we see that it 
classifies the majority of data as negative, so it has a great defect in 
terms of classification of positives, despite the high value of AUC. 
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3.5 Random Forest 


A Random Forest algorithm consists of a combination of supervised 
predictive trees. Each tree independently relies on the values of a 
randomly tested vector and with that same distribution for each of 
them. 


The model provides us that in 696 cases the prediction was cor- 
rect. These correct data are composed by 304 predicted correctly 
as positives and 392 predicted correctly as negatives. By the other 
hand, we find that in 304 cases the prediction was incorrect. These 
incorrect data are composed by 196 predicted incorrectly as posi- 
tives and 108 predicted incorrectly as negatives. 


Figure 6: Random Forest Metrics 
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We see that the AUC is 0.70. This result is much lower than all 
of the previous algorithms which can also be seen in the confusion 
matrix. 


3.6 Perceptron 

Perceptron is an algorithm for supervised learning for binary clas- 
sification. This classifier makes its predictions based on a linear 
predictor function, combining weights and the feature vector. It 
creates a hyperplane so if the training set Z is not linearly separable 
it will not separate correctly both classes. 


Figure 7: Perceptron Metrics 
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After training the model we have that in 736 cases the prediction 
was correct. These correct data are composed by 392 predicted 
correctly as positives and 344 predicted correctly as negatives. By 
the other hand, we find that in 264 cases the prediction was incorrect. 
These incorrect data are composed by 108 predicted incorrectly as 
positives and 156 predicted incorrectly as negatives. 


As we can see in the Figure 4, the model reaches a 0.74 AUC 
and separates, with the same quality as the previous model, the 
positive and negative data. 
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3.7 Multilayer Perceptron (MLP) 


A Multilayer Perceptron (MLP) is an artificial neural network formed 
by multiple layers, in such a way that it has the capacity to solve 
problems that are not linearly separable. This provides a solution to 
the problem found in perceptron, as we saw in the previous section 
[21]. 

In this case, since the Neural Network training process is harder 
than other methods used before, we will train the MLP with both 
dataset, using for the validation process 10% of all data. 


Figure 8: Multilayer Perceptron Metrics 
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The MLP has an input layer through which data enters on it, 
and three hidden layers connected to each other with a non-linear 
activation function in order to find non-linear patterns, in this case 
sigmoid. Finally we find the output layer which activation function 
is also sigmoid. 


As we can see in Figure 8, the network reaches 0.84 of AUC. It 
can be seen in the confusion matrix that this algorithm has effec- 
tively classified both the positive and negative data. 


4 RESULTS AND DISCUSSION 


Reviewing only the AUC metric in Table 3 after having tested 
several supervised learning methods, the highest result comes from 
the Tree Decision with an AUC of 0.76. However, with the lowest 
precision rate of 0.38 and 0.538 of F1, it is a classifier that has many 
failures in the positives. 


Table 3: Metrics per Classifier 


Classifier F1 Recall Precision AUC 
Naive Bayes 0.739 0.771 0.71 0.75 
SVM 0.721 0.753 0.69 0.73 
Logistic Regression | 0.730 0.761 0.70 0.74 
Tree Decision 0.538 0.909 0.38 0.76 
Random Forest 0.666 0.737 0.61 0.70 
Perceptron 0.748 (0.715 0.78 0.74 
MLP 0.857 0.869 0.85 0.84 


On the other hand, Naive Bayes is a classifier with an AUC is 0.75. 
This result is a very small difference between the Tree Decision 
of 0.01, but it classifies in a more equitable way the negative and 
positive data. As we can see in Table 4 Tree Decision has 316 
false positives and Naive Bayes only 145 which means that Tree 
Decision can not predict in a proper way positive instances. It can 
be seen in Figure 9 and also in Table 4, Naive Bayes is the classifier 
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that has more correct predictions, with 750 versus the second one, 
Perceptron, with 736. 


The classifier with the worst AUC value is Random Forest with 
0.70 AUC but Tree Decisions is the worst method taking into ac- 
count the number of correct predictions, 668. By the other hand, 
MLP provides us with an AUC of 0.84 which means a great advan- 
tage over machine learning methods. 


Table 4: Confusion Matrix per Classifier 


Classifier C I TP FN TN FP 
Naive Bayes 750 250 | 355 105 395 145 
SVM 733 267 | 346 113 387 154 


Logistic Regression | 741 259 | 351 110 390 149 
Tree Decision 668 332 | 184 16 484 316 
Random Forest 696 304 | 304 392 108 196 
Perceptron 736 264 | 392 156 344 108 


Although the best AUC comes from the Multilayer Perceptron, 
we take into account other parameters such as the training time or 
the optimal hiperparameters fitting, thus, based on all these factors 
we chose Naive Bayes as the best classifier. 


Once we have compared the main supervised machine learning 
methods tested in this experiment and based on the results that we 
have obtained and can be seen in Figure 9, we performed a quick 
test using the best classifier with tweets downloaded from the API 
REST. In this case, the tested algorithm was Naive Bayes. 


Table 5: Twitter Real Application 


Sentiment Predicted 


Tweets Labels 
Have a great day as well! We have 

many more promos planned. Stay tuned! 1 

I would be more than happy to further investigate 

this transaction. 0 
That's AMAZING! Congrats on the 20th year. 

We greatly appreciate you as a member. 1 


As we can observe in Table 5, tweets are indeed classified cor- 
rectly. As we do not have ground truth and therefore no metric, 
we observe that the Naive Bayes model trained in this study works 
with new inputs. 
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Figure 9: Results 
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5 CONCLUSIONS 


Sentiment analysis is about understanding how humans are feeling 
related to an action, what could give to the companies with an 
online presence the power to understand the opinions of their 
customers online. 


In this study we compared several supervised learning methods 
based on words frequencies in order to identify the most accurate 
model to analyse sentiment. 


In Table 3, we can observe the main metrics which measure 
each model. In red, we can see that the lowest AUC is Random 
Forest and the highest in the Multilayer Perceptron. Also, it can 
be observed that Tree Decision has 0.76 in AUC but when we see 
other metrics we can see that precision metric is the weakest. 


In conclusion, we can determine that the best machine learning 
classifiers in supervised learning to carry out this sentiment analysis 
studies are Naive Bayes and Perceptron. Neural Networks, although 
the processing, training and fitting of hyperparameters are more 
computational expensive, provide us a precise binary prediction. 
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ABSTRACT 


In this paper, we build on top of the MalConv neural networks learn- 
ing architecture which was initially designed for malware/benign 
classification. We evaluate the transfer learning of MalConv for 
malware multi-class classification by extending its contribution 
in several directions: (1) We assess MalConv performance on a 
multi-classification problem using a new dataset composed of solely 
malware samples belonging to different malware families, (2) we 
evaluate MalConv on the raw bytes data as well as on the opcodes 
extracted from the reversed assembly samples and compare the 
results, (3) we validate the MalConv findings about regularization, 
and (4) we study MalConv performance when using a medium 
size dataset and limited computational resources and GPU. The 
obtained results show that MalConv performs equally well for 
multi-classification and its performance on raw byte sequences is 
comparable to opcodes sequences. DeCov regularization is shown 
to improve the accuracy results better than other regularization 
techniques. 


CCS CONCEPTS 


e Security and privacy — Malware and its mitigation; e Com- 
puting methodologies — Machine learning; Neural networks; 


KEYWORDS 


Transfer Learning, Malware, Classification, Regularization, Deep 
Learning 


ACM Reference Format: 

Mohamad Al Kadri, Mohamed Nassar, Haidar Safa. 2019. Transfer Learning 
for Malware Multi-Classification . In 23rd International Database Engineering 
& Applications Symposium (IDEAS’19), June 10-12, 2019, Athens, Greece. 
ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3331076.3331111 


1 INTRODUCTION 


Malware data are fundamentally different than text and image data. 
For instance, a byte in a malware executable has different meanings 
depending on its location in the code and its context. It can be an in- 
struction, a part of an instruction, a part of an address, an argument, 
a data item, etc. In contrast, a byte in an image or a video always 
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represents the pixel intensity or the color code. A byte in a text 
sequence would always represent a letter. Therefore, dealing with 
malware data, executables, or software in general, requires different 
machine learning techniques and artifacts. This same idea has been 
highlighted in [25] and it is part of the motivation behind many 
research work on the topic of malware detection and classification 
using deep learning. 

Raffed et. al. proposed MalConv as a convolutional neural net- 
work architecture for malware detection by eating a whole exe [18]. 
Previous machine learning work was based to a large extent on 
the quality of features extracted from the raw byte data. As a ma- 
jor shift, deep learning nowadays claims that it can automatically 
discover and encode relevant features without recurring to any 
tedious manual engineering and expert selection of features. Still, 
good architecture design is required to achieve high performance. 
MalConv [18] has been specially designed as a shallow model to 
deal with very long sequences (often millions of bytes). Deep learn- 
ing architectures for sequences such as Recurrent Neural Networks 
(RNN) [23] and Long Short-Term Memory (LSTM) [9] usually deal 
with very small sequences representing sentences which makes 
it possible to go deeper with many layers. In MalConv, the whole 
malware byte code is represented as one sequence. The large length 
of sequences requires a design with a moderate number of layers 
to meet the available computational and memory resources. 

In machine learning, transfer learning is to take knowledge the 
neural network has learned from one task and apply that knowledge 
as the starting point for training a model on a separate task. Transfer 
learning is usually successful when low-level features from the first 
task could help learn the second task. Two examples are learning to 
classify radiology images based on an object recognition classifier, 
or learning to drive quadcopters based on a self-driving car model. 
Transfer learning has been most useful where relatively little data 
are available to train a model for the target task. 

In this paper, we propose transfer learning of the MalConv archi- 
tecture and experiment with a new dataset. MalConv was initially 
designed for malware/benign classification and has not been tested 
in multi-class settings. We evaluate MalConv for malware multi- 
class classification and extend its contribution in several directions: 


e We assess MalConv performance on a multi-classification 
problem using a new dataset composed of solely malware 
samples belonging to different malware families. 

e We evaluate MalConv on the raw bytes data as well as on 
the opcodes extracted from the reassembled samples and 
compare the results. 

e We validate the MalConv findings about regularization, espe- 
cially that DeCov [4] is a much better regularizer than batch 
normalization [10]. The reason is that, as we will show, the 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 151 


IDEAS’19, June 10-12, 2019, Athens, Greece 


activation distributions are multi-modal and far from fitting 
a Gaussian distribution. 

e We study the performance of MalConv with a medium size 
dataset and limited computational resources and GPU. We 
evaluate the technical limits of operating on a moderate GPU 
and a single off-the-shelf machine. 


Note that since we do not have the original MalConv pre-trained 
model, we could not experiment with only replacing and re-training 
the last one or two layers. Instead, we have trained all the layers 
of MalConv using several parameterizations. We consider trans- 
fer learning because our data-set is very small compared to the 
datasets used to train the original MalConv. To our knowledge, this 
is the first work addressing transfer learning from malware/benign 
classification to malware multi-classification. 

Our results show that MalConv performs equally for multi- 
classification and its performance on raw byte sequences is com- 
parable to opcode sequences. DeCov regularization improves the 
accuracy results better than other regularization techniques. 

The rest of this paper is organized as follows: We discuss related 
work in Section II. In Section III we present the MalConv architec- 
ture for multi-classification. In Section IV we introduce the dataset, 
perform experiments and discuss the obtained results. Section V 
concludes the paper and discusses future work. 


2 RELATED WORK 


Malware is a major threat to the Internet of today. Malware data are 
fundamentally different than text and images [25]. Accuracy num- 
bers on malware clustering and classification are not representative 
and sometimes misleading. In [14] six commercial anti-viruses are 
shown to be biased and not better than a simple plagiarism de- 
tection algorithm. This bias is mainly due to unbalanced datasets 
where most malware instances are easy to classify. Machine learn- 
ing algorithms in general and neural networks, in particular, are 
proposed as a way to cope with these challenges. 

Neural networks have been around for decades. They resurged 
thanks to the unprecedented data availability and computational 
scale, and have achieved major breakthroughs in many domains 
such as games, visual object recognition, language modeling, and 
speech recognition. Deep learning is informally known as a set of 
recent neural network designs such as word embedding, convolu- 
tional networks, and recurrent/recursive networks. Deep learning 
has been proposed to enhance the two branches of malware analy- 
sis, namely static analysis, and dynamic analysis. 

Static analysis is about extracting syntactical and semantical 
features from the binary or the disassembled malware using tools 
such as IDA [6]. Its main challenge is code obfuscation, metamor- 
phic and polymorphic malware. Static analysis can be reinforced 
by deep learning as proposed in [22]. In that paper, the identifica- 
tion of function starts and ends in the binary code is addressed. 
Experiments with recurrent neural networks show that functions 
in binaries can be identified with greater accuracy and efficiency 
than many other machine learning algorithms. 

Dynamic analysis runs the malware in a sandbox such as Cuckoo 
sandbox [21] and monitors its activities such as system calls and 
file access patterns. Dynamic analysis is known to be time and 
resource consuming. Moreover, its main challenge is that malware 
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can detect the surrounding environment and keep calm [7]. Dy- 
namic analysis can be reinforced by deep learning as proposed in 
[13]. This approach suggests classifying malware samples based 
on their system calls sequences. Experiments were performed with 
one-dimensional convolutional networks taking sequences of sys- 
tem calls in the form of a set of n-grams. However, the drawback 
of convolutional networks is that they do not explicitly model the 
sequential position of system calls. On the other hand, recurrent 
networks train a stateful model by using full sequential information. 
The drawback is that RNNs are more complex and more difficult to 
train. The authors show that by combining those two layers within 
the same hierarchy, malware detection capabilities are increased. 

The malware dataset [20] that we intend to use in this work was 
used in much related work such as in [8]. In [8], two approaches 
were proposed: the first approach represents malware samples as 32 
x 32 gray-scale images and inputs them to a convolutional network 
with max pooling followed by fully connected layers. The original 
idea of representing malware as an image has appeared in [17]. 
This approach suggests transforming the binary into a vector of 
8-bit integers, which can be reshaped into a matrix and therefore 
viewed as a gray-scale image. However as discussed in MalConv 
[18], the receptive field of a convolution represents discontinuous 
sequences in the original malware. This fact suggests that using one- 
dimensional convolutional networks is more valuable. The second 
approach in [8] recognizes this fact and uses a scheme that initially 
appeared in [12]. The initial approach deploys a convolutional 
layer with multiple filter widths and feature maps on top of word 
vectors obtained using Word2Vec [15]. We report a very comparable 
performance based on transfer learning of MalConv in this paper. 

A subset of the authors has previously proposed modeling mal- 
ware as a language and experimented with a document-distance 
approach in [2]. Our approach showed promising preliminary re- 
sults, but it still requires computational performance improvement. 
We also experimented with using t-SNE to throttle malware families 
in 2D or 3D for visualization purposes based on n-grams features 
[16]. 

Few work addresses transfer learning for malicious software 
classification. In [19], a malware family classification approach is 
presented based on the ResNet-50 architecture, which is a deep 
learning model for image classification. Therefore, the Malware 
samples were represented as byte grayscale images. However, repre- 
senting malware as a grayscale image is lossy in terms of sequential 
information and has been criticized in literature. Transfer learning 
is also used in a limited way in [11]. In that work, a generative 
adversarial network (GAN) composed of a generator and a discrim- 
inator is proposed. The generator learns to produce fake malware 
samples and make them indistinguishable from real ones. The goal 
is to generate samples which are most similar to zero-day attacks. 
Transfer learning was needed to stabilize the generator based on a 
pre-trained auto-encoder of malware characteristics. The discrimi- 
nator is supposed to detect zero-day attacks efficiently by learning 
to separate fake malware samples from real ones. 


3 MALCONV ARCHITECTURE 


MalConv [18] was originally designed for the task of malware/benign 
classification on raw bytes data. It has outperformed many other 
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Figure 1: High-Level Diagram of the MalConv Architecture 


Table 1: Distribution of samples among the families 


Class | Family Type Nb. Of Instances | Percentage (%) 
1 Ramnit Worm 1541 14.20 
2 Lollipop Adware 2478 22.80 
3 Kelihos ver 3 Backdoor 2942 27.07 
4 Vundo Trojan 475 4.37 
5 Simda Backdoor 42 0.39 
6 Tracur Trojan Downloader 751 6.91 
7 Kelihos ver 1 Backdoor 398 3.66 
8 Obfuscator.ACY | Obfuscated malware 1228 11.30 
9 Gatak Backdoor 1013 9.32 


architectures including deep convolutional and various RNN with 
different attention models. MalConv is based on the idea of gated 
convolutional networks that was originally proposed in [5]. Mal- 
Conv design is mainly driven by the following principles: 


e Preserving a high level of generalization to previously un- 
seen samples. This constraint has ruled out approaches that 
do not have convolution filters at all, since big parts of the 
executable may look benign while a small part somewhere 
in the file is malicious. 

e Dealing with very large sequences. This constraint has lim- 
ited the number of convolutional layers to keep up with 
memory resources. 

e Dealing with information sparsity by applying max-pooling 
instead of average-pooling. Average-pooling may lead to 
loss of sparse features having high pitched responses. 

e Avoiding applying RNN on top of CNN layers. It seems that 
RNN are overfitting the sequencing patterns at the output 
of the CNN and not generalizing well. 


In this paper we propose transfer learning of MalConv to fit the 
multi-class malware classification settings as shown in Figure 1. It 
starts by feeding the bytes to an embedding layer (trainable lookup 
table with eight output dimensions). Otherwise, bytes would be 
considered similar if they have close numerical values, which is 


wrong. The embedding layer output is fed into two 1-D convolu- 
tional layers in parallel; only one of them has non-linear (sigmoid) 
activation. The convolutional layers have 128 filters (in depth) with 
rather a large filter width of 500 bytes combined with an aggressive 
stride of 500 bytes. The outputs of the two layers are elementwise 
multiplied and optionally passed to a rectified linear unit (ReLU). 
Then a temporal max pooling layer takes the global maximum of 
each of the 128 channels. The last part is a fully connected neural 
network (with optional ReLU activation) having 128 input nodes 
and nine softmax output nodes corresponding to the different ma- 
licious classes of malware. The classes will be presented in the 
experiments section. 

An important factor in MalConv design is the regularization. 
The authors of MalConv suggested that penalizing the correlation 
between hidden state activations at the fully connected layer, as de- 
scribed in [4], to be the most effective form of regularization. Quot- 
ing from [3]: "Intriguingly, the commonly used batch-normalization 
actually prevented the models from converging and generalizing." 
In our experiments, We have obtained very similar results that will 
be discussed in the next section. 


4 EXPERIMENTS 


In this section, we present the results of the experiments on the 
multi-class dataset [20]. We measure the performance in terms of 
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accuracy and loss. The accuracy is simply the ratio of the number of 
correct classifications to the number of all decisions made. The loss 
is defined as the cross-entropy function between the correct labels 
and the probabilities output by the network. The loss function is 
seen as a stronger indication of performance than mere accuracy. 
We used Keras [1] for the implementation of the tested networks. 
The main hardware component is a GeForce GTX 750 model with 
2 GB Memory, and 5.0 compute capability. We start by presenting 
the dataset in the next paragraph. 


4.1 Dataset 


In this work, we used the data set provided by Microsoft and hosted 
at Kaggle [20]. The data set includes 10868 labeled samples and 
10873 unlabeled ones. For each sample, the raw data and the meta- 
data are provided. The raw data contains the hexadecimal repre- 
sentation of the malware executable byte code, with the Portable 
Executable (PE) header removed to ensure sterility. The meta-data 
file is a manifest generated using the IDA disassembler tool. It rep- 
resents the disassembled file (in X86 assembly language) containing 
various meta-data information such as function calls, strings, etc. 
The dataset contains malware samples belonging to the follow- 
ing nine families: Ramnit, Lollipop, Kelihos Ver. 3, Vundo, Simda, 
Tracur, Kelihos Ver. 1, Obfuscator.ACY and Gatak. One challenge 
of this data set is the unbalanced sizes of different families. The 
distribution of instances for the training dataset is shown in Table 1. 
Classes 4, 5, 6 and 7 are underrepresented as compared to the other 
families. 

MalConv models a malware sample as one long sequence of 
bytes. Also, We experiment with another model which is one long 
sequence of opcodes. The sequence of opcodes is on average 60 
times shorter than the sequence of bytes. Therefore using opcodes 
requires fewer GPU resources. Still, a rather costly preprocess- 
ing phase of reassembling is required. Sometimes reassembling 
is erroneous when malware authors have protection techniques 
against some known disassemblers. For instance in our dataset, 4 
Kelihos_ver3, 22 Vundo, 6 Kelihos_ver1, 2 Lollipop, and 9 Obfusca- 
tor.ACY samples do not have any opcode. 

Using sequences of raw bytes is more attractive since it has all 
the information. However, it can contain a lot of noise which is 
deliberately injected by the malicious developers. 

In all cases, the trade-off between using sequences of bytes or 
sequences of opcodes seems interesting. Our intuition for opcodes 
is that the obfuscation techniques used by the malware authors in- 
troduce most of the time junk assembly instructions and a different 
context for each instruction. If malware families use different obfus- 
cation techniques leading to different opcode sequencing, then this 
can be encoded by embedding layers and caught by convolutional 
layers as features. We test this hypothesis by embedding the op- 
codes of each malware family in our data set using Word2Vec [15]. 
The embedding is presumably sensitive to the absence and presence 
of some opcodes, different ordering and different surroundings also 
referred to as a context. 

Figure 2 illustrates how malware classes have different embed- 
ding representations. Each subfigure represents the embedding of 
the union set of all opcodes for the samples of a given malware class. 
The embedding is originally learned by a Word2Vec model with a 
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context window size of 5 and eight output dimensions. The eight di- 
mensions are then compressed into two using t-SNE dimensionality 
reduction with perplexity of 40 and PCA initialization. 

Some special opcodes are cross marked in the scatter plots. What 
is interesting in this kind of plots is not the absolute position of an 
opcode in the space but rather the relative neighboring information. 
We notice, for example, that mov and jmp are very close in class 1 
whereas they are rather separated in class 2. 


4.2 Bytes vs. Opcodes 
We split the labeled data into three sets: 


e training set: 6526 samples (60%) 
e validation set: 2175 samples (20%) 
e testing set: 2175 samples (20%) 


We show in Figure 3 the per-class validation and testing accuracy 
for the byte sequences and for the opcode sequences. 

The two models show good yet similar performance. In exception, 
the worst accuracy is for class 5 (Simda). This is due to having 
very few samples available for training. We also tested using the 
unlabeled data sets and obtained an overall log loss of 0.093 for 
opcodes and 0.122 for raw bytes as per returned Kaggle private 
scores. 


4.3 Regularization 


We start by showing that using batch normalization [10] on top of 
the convolutional filters is mostly not a good regularizer for our 
dataset. 

Figure 4 evaluates the validation accuracy for raw bytes and 
opcodes in terms of training epochs for small batch sizes (only four 
sequences for both bytes and opcodes). It shows that using batch 
normalization has worse results in both cases. Still training for 
more epochs seems beneficial in the case of opcodes. 

Figure 5 evaluates the validation accuracy for raw bytes and op- 
codes in terms of training epochs for large batch sizes (16 sequences 
for bytes and 128 sequences for opcodes). Similarly, using batch 
normalization has worse results in both cases. Batch normalization 
seems indifferent in the case of bytes and much worse in the case 
of opcodes in this experiment. 

So why is batch normalization, which is known to be a very 
successful technique, spectacularly failing in our case? It turns 
out that batch normalization works better when the output of the 
convolution filters has a Gaussian distribution. This does not seem 
the case for our data which is showing multi-modal distributions. 
As an example, we plot the probability density function of an early 
activation node in the network along with the PDF of a fitting 
Gaussian distribution in Figure 6. 

MalConv authors propose to use DeCov [4] as a better regularizer. 
DeCov explicitly penalizes the correlation between the activations 
in the fully connected layers, hence preventing these and the previ- 
ous convolutional layers from overfitting and redundantly encoding 
the same information. It does so by directly adding the correlation 
terms to the loss function. Therefore, DeCov works in a very similar 
way to dropout regularization [24] except that the way dropout 
works is much more implicit. 

Table 2 shows that DeCov has the same effect on our dataset, 
both for byte and opcode sequencing models. DeCov has even better 
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Figure 2: Embedding representation for each malware class 


Table 2: Classification Accuracy and Cross Entropy (training/testing) for Bytes and Opcodes without DeCov, with DeCov and 


with Dropout (50%) 
MalConv without DeCov | MalConv with DeCov | MalConv with Dropout 
Cross- Cross- Cross- 
Test/Validation Set | Accuracy | Entropy Accuracy | Entropy Accuracy | Entropy 
Opcodes 95/95% | 0.30/0.26 96/96% | 0.21/0.20 93/93% | 0.44/0.44 
Raw Bytes 95/95% | 0.33/0.24 97/98% | 0.22/0.13 95/95% | 0.20/0.19 
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Figure 3: Average classification accuracy per malware class 
for the raw bytes and opcodes data. 
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Figure 4: Comparison of the validation set accuracy of with- 
vs.-without Batch Normalization for small batch sizes 


performance than dropout. Dropout with a ratio of 50% has better 
results on bytes than on opcodes. However, tuning the ratio of 
dropout (e.g., by decreasing the drop rate to 25%) leads to better 
results. 


4.4 GPU Memory limitations and sequence 
length vs. batch size trade-off 


Our experiments were constrained by using only one GPU GeForce 
GTX 750, 2 GB memory. This limitation has impacted the length 
of sequences that can be dealt with. We have tried to surmount 
this problem by decreasing the batch size, changing the network 
parameters at different layers and truncating the sequences. This 
has led to three network configurations: 


e Network1 (convolution stride = 500, filter width = 500, num- 
ber of filters = 128), 

e Network2 (convolution stride = 1000, filter width = 1000, 
number of filters = 64), 
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Figure 5: Comparison of the validation set accuracy per 
epoch of with-vs.-without Batch Normalization for large 
batch sizes 
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Figure 6: The probability density functions of a convolu- 
tional filter output in case of opcodes vs. bytes vs. Gaussian 


e and Network3 (convolution stride = 100, filter width = 500, 
number of filters = 160). 


Network1 is very similar to the original MalConv architecture, Net- 
work2 is less memory demanding and Network3 is more memory 
demanding. In Table 3, we list the maximum possible sequence 
length for each network type and network parameters. We obtained 
these values using an iterative binary search. Generally increasing 
the batch size and the number of filters improves the classification 
accuracy even when sequences are moderately truncated. 


5 CONCLUSION 


In this paper, we have assessed the transfer learning of the MalConv 
neural networks architecture for a multi-class malware classifica- 
tion problem, which is a shift from the original goal of the MalConv 
design. However, we found that MalConv has very similar per- 
formance on the studied dataset and that the lessons learned by 
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Table 3: Maximum Possible Sequence Length under Different Batch sizes and Network parameters 


Data type (Embedding size) — Opcodes (8 dimensions) Bytes (4 dimensions) 
Network type | Batch size | Maximum length || Batch size | Maximum Length 
64 150,247 8 2,342,505 
Networki 128 | 64,391 16 1,171,252 
256 30,049 
64 163,125 8 2,459,630 
Networka 128 — | 85,855 16 1,210,294 
256 34,342 
64 128,783 8 1,756,879 
Network3 i on 
TUM 128 | 62,245 16 976,044 
256 27,903 


MalConv are still valid for the case of malware multi-classification 
problem. Another deviation that we have experimented with is 
considering the malware as a sequence of opcodes rather than a se- 
quence of raw bytes. Although this approach requires preprocessing 
overhead, it makes the length of the sequences much more manage- 
able given limited GPU memory resources. In particular, we vali- 
date the fact that regularizers such as dropout and DeCov are much 
more likely to improve the network convergence whereas batch 
normalization can have negative effects. To our knowledge, this is 
the first attempt to apply transfer learning from malware/benign 
classification to malware multi-classification. 

The size of the target dataset is rather small compared to the 
datasets studied in the MalConv paper, which is a strong motivation 
for transfer learning. In future work, we aim to experiment with 
more datasets derived from different sources to validate the viability 
of transfer learning in malware data on a larger scale. We also would 
like to experiment transfer learning using the original pre-trained 
model of MalConv from NVIDIA, if publicly available. 
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ABSTRACT 


The market for invoice financing has been steadily growing in the 
last few years and has been the third financing market in size in 
2016. Most solutions in this field are based on private platforms and 
even the new proposals based on blockchain are mostly adopting 
a private, permissioned blockchain. In this paper, we propose an 
idea based on a public blockchain that allows both fully open and 
group-restricted auctioning of invoices. Furthermore, our proposal 
introduces a reputation system that is based on the past behavior 
of entities, as it is photographed by the public blockchain, to allow 
insurance companies modulate the cost of the insurance contracts 
they offer. This combination guarantees the complete transparency 
and tamperproof-ness of a public blockchain, while it allows reduc- 
ing insurance costs and fraud possibilities. 


CCS CONCEPTS 


* Security and privacy — Database and storage security; Data- 
base activity monitoring. 
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1 INTRODUCTION 


Companies work hard to ensure economic liquidity and maintain 
steady cash-flow, that said, those important factors are seriously 
affected by the long invoicing due dates which represent a big 
challenge, especially for small to medium enterprises (SMEs). In 
order to overcome this issue companies make use of different forms 
of invoice financing such as factoring. This type of financing enables 
businesses to cash-in invoices before their due date. The process 
of factoring can be described as follows: an SME sells the invoice 
to a factoring company which is often a financial institution for a 
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pre-agreed percentage of the invoice amount, the buyer then pays 
the factoring company the full invoice amount on the due date. 
While this helps the SME solve the cash-flow issues, it exposes the 
factoring companies to serious fraud risks mainly because of the 
lack of communication among themselves. In fact, a well known 
fraud risk in factoring is double financing, where the SME sells the 
same invoice to more than one financial institution. The buyer will 
naturally pay the invoice once, paying only one institution and 
leaving the rest unpaid. Another considerable risk is represented 
by a situation where the buyer refuses to pay as agreed on the due 
date of the invoice. One of the main reasons that leads to this is the 
fact that a financial institution does not have a direct relationship 
with the buyer and relies only on the information provided by the 
seller, in our example the SME. 

One potential solution to the double financing problem is an 
invoice financing platform hosted on a centralized database where 
all the potential invoice-buyers can verify whether the invoice 
has been already funded or is still available. However, centralized 
systems can be expensive, they are a single point of failure, and they 
are prone to privacy infringement, data manipulation and attacks 
which may make them unreliable and untrustworthy. Luckily, with 
the emergence of blockchain technology and smart contracts, we 
no longer have to rely on centralized systems. Blockchain may be 
used to implement an immutable, trusted, and decentralized ledger 
[6] that relies on a consensus algorithm to decide which data is 
appended [13]. 

In this paper, we propose an invoice financing solution through 
auctioning based on InterPlanetary File System (IPFS) [2] and 
Ethereum blockchain [16]. The invoice data is stored on the IPFS 
while its corresponding IPFS hash is stored into a blockchain smart 
contract in order to ensure integrity, traceability and authenticity 
of the invoice. Moreover, the proposed solution uses a reputation 
system which contributes to reduce the fraud risks. The rest of this 
paper is structured as follows: In Section 2 we introduce the invoice 
financing solution; in Section 3 we describe the frauds scenario 
and countermeasures; in Section 4 we present related work. Finally, 
Section 5 concludes this paper. 


2 THE PROPOSED INVOICE FINANCING 
SOLUTION 


2.1 System overview 


In this paper, we propose a prototype of an invoice financing plat- 
form for SME based on InterPlanary File System (i.e., IPFS), reputa- 
tion profiles, and smart contracts hosted on Ethereum blockchain. 
Every function call that modifies the blockchain state or smart con- 
tract executed on the Ethereum blockchain requires Gas [1]. Gas 
is a unit that is used to calculate the amount of fees that need to 
be paid to the network in order to execute an operation. Since the 
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invoice data are very sensitive and storing this data directly in the 
blockchain is very expensive, we do not plan to store the whole 
invoice inside the blockchain. On the contrary, we propose to use 
IPFS to store these data in a decentralized, distributed manner that 
is publicly and globally accessible through the use of IPFS hashes. 
At the same time, to control access to the data, we encrypt the IPFS 
hash with the authorized investors public keys and store only these 
into a smart contract. Thus, any modification of the invoice content 
would change the IPFS hash, and would then not match the hash 
stored within the smart contract. The confidentiality of invoice data 
is ensured because only the authorized investors will be able to 
access it using their private keys. 
The main components of our platform are: 


e asmart contract hosted on the Ethereum blockchain, 

e the Ethereum client, 

e IPFS, 

e a web app. 

The web app provides a graphical user interface for the Ethereum 
client, which in turn interacts with the smart contract on the 
Ethereum blockchain. The roles of the participants can be sum- 
marized as follows: 

Seller: is a company that has the goods to be packaged and 
transferred to the buyer and it is looking to improve its cash flow 
by creating a smart contract capable of selling the invoice to one 
of the investors enrolled in the platform through an auction. This 
kind of company is usually an SME. 

Buyer: is a company that would like to purchase the goods from 
the seller by paying the shipping amount on delivery and benefits 
from the delayed payment of the full invoice amount (i.e., the price 
of goods plus taxes). 

Authorized investor: is a person or a financial institution that 
is allowed to participate in the auction to buy the invoice at a price 
lower than its real value to gain a profit. 

Insurance: is responsible to reimburse the authorized investor 
in case the buyer refuses to pay. 

Unlike the traditional financing model, our platform does not 
limit the factoring service to banks and financial companies. Any 
investor can subscribe to the web app and make an offer to partic- 
ipate in the auction of an invoice. The highest offer made by an 
authorized investor that satisfies the minimum requested amount 
wins the auction once the bidding time has expired. This enables 
the SMEs to invite a large number of investors around the world 
and get the best financing offer in short time and with less effort to 
get funding. 

At the same time, the buyer will benefit from the delayed invoice 
payment to optimize the use of their working capital. 


2.2 Challenges 


Since the investors do not have any direct knowledge of either the 
seller or the buyer, they are exposed to a considerable amount of 
risk. As an example, there is the risk of the invoice not being paid as 
agreed by the buyer; another significant risk is the seller knowingly 
submitting false, modified or duplicate invoices with the intent to 
commit a fraud, either acting alone or in collusion with the buyer. 
A solution might be to add risk insurance to refund the investor; 
however, in the absence of significant countermeasures aiming 


at reducing the fraud opportunity, the cost of such an insurance 
will make the whole operation economically unfeasible. Hence, the 
simple addition of an insurance is not considered a viable solution. 


2.3 System design 


The proposed platform mitigates these risks by adding transporter 
entity and reputation profile. The former provides information 
about shipping status while the latter shows the list of invoices 
that has been paid or unpaid by the buyer on the due date without 
showing the confidential data. This can help investors in the selec- 
tion of trustworthy counterparts while pushing malicious buyers 
off the system. 

The platform allows the seller and their counterparts to register 
by selecting the account type (e.g., seller account, investor account, 
etc) and providing an identity certificate which is unique to make 
sure that they can not create another account with a clean reputa- 
tion profile in case of fraud. The services are provided according to 
the type of the account and every time the contract data changes, a 
notification is sent to the counterpart. 

As shown in Figure 1, the seller writes the invoice data into IPFS 
and creates a smart contract that specifies the minimum amount 
required to participate in the auction and the hash to retrieve the 
invoice from IPFS. Then, he deploys it into the Ethereum blockchain. 
If the invoice is genuine, the buyer accepts the invoice and pays 
the shipping amount. When he accepts the invoice the buyer states 
that he verified all the information mentioned in the invoice and 
he agreed to pay the shipping amount immediately and the entire 
amount on the due date as specified in the invoice. Afterward, the 
investors can participate in the auction and thus read the invoice 
data and make an offer after checking the following conditions: 


the invoice has been accepted by the buyer; 

the "invoice ID" has not been submitted before; 

the buyer confirmed the delivery in order; 

the reputation profiles of both the seller and the buyer show 
that they are trustworthy. 


If the reputation profile shows that one of them is untrustworthy 
or the invoice does not meet one of the above mentioned require- 
ments, then it will not be funded by the investors. An investor 
that decides to finance an invoice in spite of the above mentioned 
problems is fully responsible of his decision and knows that, in 
case of fraud, his request of refund will be rejected by the insur- 
ance. Beside protection against double financing and submitting 
false or modified invoice, our platform mitigates the risk of a buyer 
that does not pay as agreed. In fact, in our platform the reputation 
profile will show that a buyer is untrustworthy and investor may 
freely take a fully informed decision if they want to run the risk. 
Thus, our platform facilitates the invoice financing for SME and 
reduces the risk of frauds. 


2.4 The proposed invoice financing workflow 


Figure 2 illustrates the message sequence diagram of selling the 
invoice through an auction with two possible scenarios. In the 
first scenario the buyer pays on due date of the invoice while in 
the second the buyer refuses to pay. The interactions between the 
different entities with the smart contract are as follows: 
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Figure 1: Invoice financing solution based on blockchain and IPFS. 


(1) The seller creates a smart contract and deploys it in the 
Ethereum blockchain. The seller can choose to open the 
auction to all the investors in the platform or only to some 
predefined investors. In case of two authorized investors, the 
main contents of the smart contract are: hash (Invoice ID), 
shipping amount, the minimum bid requested, the highest 
bid, offers, auction deadline, shipment status and IPFS hash 
encrypted with public key of investor 1, 2 and the buyer. 

(2) The buyer decrypts the IPFS hash using his private key and 
verifies the invoice data. If the invoice is genuine the buyer 
accepts the invoice and performs a safe payment of the ship- 
ping price. The smart contract holds this amount of Ether 
until the delivery. 

(3) The transporter verifies if the invoice has been accepted by 
the buyer then, updates the shipment status on the smart 
contract to "in transit" upon receiving the goods. 
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(4) The buyer verifies if the shipment status on the smart con- 
tract is "in transit" then, updates it to "delivered" once the 
goods are received. The smart contract payout the trans- 
porter for the shipment. 

(5) The investors verify the participation conditions mentioned 
above in order to decide whether to bid on this invoice or 
not. 

(6) In case all the conditions are met, the first investor places his 
bid which should be higher than the minimum bid requested 
by the seller. 

(7) The second investor places his bid which should be higher 
than the highest bid (i.e., bid 1). The highest bidder become 
the owner of the invoice when the auction ended. 

(8) The seller asks for an early payment when the auction ended. 
The smart contract transfers the highest bid to the seller. 
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Figure 2: Sequence diagram of the proposed invoice financing workflow. 


(9) When the auction ends, the investor 1 asks to withdraw his "Refund request") to notify the insurance and BuyerRepu- 
funds because he did not win the auction. The smart contract tation(BuyerAddress, "Unpaid invoice on due date") to 
sends to the investor 1 his corresponding bid amount. create notification and save a log about the buyer reputation. 

(10) In scenario A, the buyer pays the entire amount on due In this scenario, the buyer profile will show that this buyer 
date of the invoice to investor 2 through the smart contract. is untrustworthy. 
An event BuyerReputation(BuyerAddress, invoice paid (12) The insurance verifies if the investor 2 did not ask refund 
on due date") will be triggered to help in tracing the buyer before and he made the necessary verification before partici- 
reputation and in notifying all parties. pating in the auction. 

(11) In scenario B, the buyer did not pay on due date of the (13) The insurance refunds the investor 2 through the smart 
invoice as agreed and thus investor 2 sends a refund request. contract. 


Two events will be triggered RefundRequest(msg.sender, 
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It is important to mention that in step 8, the seller manually 
invokes the smart contract when the auction ends to receive his 
money because the contract cannot activate itself; however, au- 
tomating the reimbursement for investors that did not win the 
auction is possible by relying on step 9 on step 8. Nevertheless, we 
added step 9 to let the investors withdraw their funds rather than 
push funds to them automatically for the following security reasons: 
i) Sending ether back to all the investors that did not win auction 
could run out of gas. ii) Sending ether to unknown addresses could 
lead to security vulnerabilities [5]. 


3 FRAUD SCENARIOS AND 
COUNTERMEASURES 


In this section, we present the possible fraud scenarios and we 
explain how the proposed solution, without relying on trusted third 
parties and just leveraging smart contracts and public blockchain 
technology, reduces the possibility of frauds in invoice financing 
between mutually untrusted entities. 

All involved entities will be able to share and monitor the in- 
formation related to invoice, auction, shipping and payment in a 
transparent manner. In addition, these information are immutable 
and cannot be changed. Therefore, the information that is used to 
build reputation profile is reliable. 

Scenario 1: The seller knowingly submits a false or modified invoice. 
Our solution prevents this fraud because the invoice will not be 
funded by the investor if it has not been already accepted by the 
buyer. The buyer will be interested into accepting the invoice only 
if it is genuine because his reputation is at stake and he could lose 
the shipping amount. 

Scenario 2: The buyer colludes with the seller, he accepts the false 
invoice submitted by the seller to commit a fraud and split with the 
seller the amount of Ether received from the investor. In this case, 
the buyer will be identified as untrustworthy. Furthermore, this 
is not enough to get funding, because the investor verifies also if 
the transporter receives the goods before deciding to finance the 
invoice. 

Scenario 3: The seller submits a duplicate invoice in order to have 
double financing. Our platform enables both the buyer and the 
investor to verify that the invoice has not been submitted before 
because of the unique "Invoice ID" and the transparency guaranteed 
by the public blockchain. 

Scenario 4: The buyer refuses to pay the investor in due time as 
stated on the invoice because he did not receive the goods. Our plat- 
form enables the investor to check if the goods has been delivered 
with a confirmation from the buyer before participating in the auc- 
tion. The transporter will be interested into having the delivery 
confirmed by the buyer because his payment depends on the ship- 
ment status. Otherwise, the transporter will not accept to deliver 
the goods. 

Scenario 5: The buyer receives the goods but refuse to pay on due 
date of the invoice. In this case, the investor will be refunded by the 
insurance and this buyer will be easily identified as malicious and 
untrustworthy through his reputation profile. 


4 RELATED WORK 


Most researchers, when proposing blockchain based solutions for 
invoice financing focus mostly on the issue of double financing. 

Nijeholt et al. [9] proposed DecReg, a framework based on 
blockchain technology to address the "double-financing" issue in 
factoring. The framework has been implemented on a private block- 
chain. The access to the blockchain is controlled by a central author- 
ity (CA). Authors pointed out that the only feasible attack would 
be a collusion between the seller and the CA, where the CA pre- 
vents the financial institution from accessing the network which 
makes it vulnerable to double-financing. Hence, the financial insti- 
tution should halt invoice financing until it regains access to the 
blockchain network. 

Hofmann et al. [7] stated that the registration of invoice on the 
blockchain provides the opportunity to prevent fraud and double- 
financing issues in invoice discounting and factoring. Each invoice 
distributed across the network is hashed, timestamped, and given 
a unique identifier to prevent multiple financing on that partic- 
ular invoice. However, authors did not provide implementation 
details such as whether the invoice is registered in public or private 
blockchain and how the different parties interact with each other. 

Similarly, Nicoletti et al. [14] stated that blockchain can play 
an important role in preventing fraud during procurement finance 
solution implementation and notably reverse factoring. Blockchain 
provides complete traceability and real-time visibility on invoices 
status which prevent the fraudulent organizations from extracting 
funds from multiple financial institutions by using the same invoice. 

In [15], authors proposed a conceptual framework based on 
blockchain technology for reverse factoring and dynamic discount- 
ing. Efficiency, transparency, and autonomy were identified as 
blockchain value drivers that will improve supply chain finance 
solutions. 

Bogucharskov et al. [3] presented possible interaction between 
supplier, customer and factor in blockchain-based factoring applica- 
tion. In their interaction model, the factor provides funding to the 
supplier upon the confirmation of the customer that he received 
the goods. However, authors did not take in consideration the fraud 
risks if the supplier or costumer are untrustworthy or malicious. 
In addition to that storing invoice in the public blockchain is very 
expensive both from the storage and from the computational point 
of view. 

Kayal et al. [8] stated that blockchain technology can be a pow- 
erful tool to tackle the financing problems of SMEs. In addition, 
they conducted an exploratory research into the appetite of the 
stakeholders involved in invoice factoring and inventory finance 
for adopting the blockchain technology. 


5 CONCLUSION 


In this paper we have put forward an idea for the invoice factoring 
and financing problem that is based on the IPFS, the Ethereum 
blockchain, smart contracts and reputation profiles. Our proposal 
is expected to provide a higher level of transparency than most 
solutions previously proposed, as it uses a public blockchain instead 
of a private one. Besides, the use of a proof-of-work based public 
blockchain also guarantees a better resilience to tampering and 
collusion. Finally, as we showed in this paper, our solution is capable 
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of preventing most practical cases of frauds and, by providing better 
guarantees, it allows lowering the costs of insurance that is needed 
to protect the involved parties from residual fraud cases. 

As a future extension, it is worth pointing out that, in principle, 
the adoption of a public blockchain based on proof-of-work may 
lead to energy wasting, as each fraud attempt carried out by any 
of the involved parties is expected to lead to some form of energy 
loss. To this aim, we argue that the energy impact of the adoption 
of a public blockchain in actual invoice financing scenarios should 
be investigated in future works, as well as energy-wasting related 
attack that malicious parties can willingly attempt. We plan to 
model the energy consumption of a public by leveraging models 
previously adopted in other contexts (like, e.g., [4, 10-12]). 
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ABSTRACT 


The amount of sources and sheer volumes of spatiotemporal data 
have met an unprecedented growth during the last decade. As a 
consequence, a rapidly increasing number of applications are seek- 
ing to generate value by crunching those data. The development of 
a system that will tap into the potential value of the spatiotemporal 
big data analysis for a multitude of applications remains one of the 
biggest challenges in computer engineering. This paper delves into 
the key-characteristics of the most prominent suchlike systems. In 
particular, it provides a thorough analysis of NoSQL datastores as 
well as a traditional relational database system in terms of their 
geospatial querying capabilities. 
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1 INTRODUCTION 


Nowadays, massive volumes of spatiotemporal data are constantly 
being generated by many scientific, engineering and business ap- 
plications. For example, remotely sensed data from NASA’s Earth 
Observing System produces 1 TB of data each day [1] while the Au- 
tomatic Dependent Surveillance Broadcast (ADS-B) system which 
gathers information about position, identification and course of air- 
crafts, produces 285 billion points per year [2]. Also, the Department 
of Computing in Federal University of Ceara (UFC) tracks vehicle 
movements in the area of Fortaleza generating huge volumes of 
data (Figure 1). 

Spatiotemporal database management systems (STDBMSs) con- 
stitute core components in tackling the challenges of spatio-temporal 
applications. They allow for the efficient management of the data 
and the application of complex queries. 


^ 


Figure 1: A small fraction of data points from vehicles mov- 
ing in Fortaleza area, gathered from UFC. 


As expected, the demand for high quality of service and in- 
creased performance introduces new non-functional requirements 
that these systems need to cope with. Such requirements include: 
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scalability, availability, reliability, consistency, performance and 
accuracy. 

Towards this end, distributed computing is considered to be 
an enabling technology. Popular implementations of DBMS are 
based on distributed computing architectures and the majority of 
them are providing for the managements of spatio-temporal data 
in their core functionality (e.g. Redis!, HBASE [3], MongoDB? and 
PostgreSQL [4]). 

The goal of this paper is to highlight these spatio-temporal func- 
tionalities of DBMSs and present the key architectural character- 
istics that are supporting them. It goes on to perform a compari- 
son between a relational database system and several widely used 
NoSQL datastores across those characteristics. Based on the litera- 
ture review and to the best of our knowledge the most prominent 
differences concerning geospatial support for these data stores, are 
their data model and a number of geospatial capabilities. Such ca- 
pabilities relate to geospatial indexing used, the geometry types 
supported and the spatial query operators performed. 

The rest of this paper is organized as follows. Section 2 presents a 
categorization of the spatio-temporal data types. Section 3 presents 
the key-characteristics of the database systems examined. Section 
4 highlights the geospatial capabilities while Section 5 presents the 
final conclusions. 


2 SPATIO-TEMPORAL DATA TYPES 


The spatio-temporal data types can be divided into two major cate- 
gories: spatial data and temporal data as shown in Figure 2. 

Within the spatial referenced data group, the data can be fur- 
ther classified into two different types, raster and vector. Generally, 
point, line, and polygon are primitive data types of vector type. 
Point data have zero dimensions and are used to represent non- 
adjacent features and discrete data points. Line data are used to 
represent linear features. Line features have a starting and ending 
point and the one dimension of which they are composed can be 
used to measure length. Polygons are used to represent areas such 
as the boundary of a city. Polygon features are two dimensional 
and therefore can be used to measure the area and perimeter of a 
geographic feature. These spatial abstract types have several com- 
mon properties such as coordinates within a reference system and 
operations like calculation of distance or containment. On the other 
hand, raster data types (grid data) are cell-based and represents 
surfaces, aerial and satellite imagery. 

Concerning temporal data, spatio-temporal databases support 
three kind of time. Transaction time which is the time that an 
object was presented as stored record in the database. The temporal 
aspects of an element evolve discretely and this kind of time is used 
to trace past states of objects. Valid time which is the time that an 
object has existed in reality. The temporal aspect of an element 
change continuously and this kind of time is applied to facts and 
events and used on object attributes and relationships between 
objects. Bitemporal time is used to trace the evolution of a dynamic 
collection of valid time facts and is a combination of transaction 
and valid time. The valid time can be of the event or period type 
while the transaction time can be of the interval type. 


! Redis, https://redis.io/ 
?MongoDB, https://www.mongodb.com/ 
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3 DATABASE SYSTEMS 


This section presents the key-characteristics of the database sys- 
tems compared. Redis, MongoDB, Neo4j and HBase belong to a 
wider category called NoSQL which describes a large class of DBMS. 
These systems do not follow the rules of the traditional relational 
DBMS and also do not use the traditional SOL queries over the 
data. NoSQL-based systems are often open source projects and are 
designed to process and handle very large datasets which are partic- 
ularly prone to performance problems caused by the limitations of 
SQL and the relational model of databases. These systems typically 
run on cluster computers made from commodity hardware, provide 
"shared nothing" horizontal scalability, can support a large number 
of concurrent users and deliver highly responsive experiences to a 
globally distributed base of them. Subsequently, they provide dy- 
namic schema and can handle semi- and un-structured data. Based 
on the data model, the NoSQL data stores can be classified into four 
major types: key-value stores, column-family stores, document stores, 
and graph stores. We consider a representative and widely used 
system from each type that include spatial extensions and provide 
a license-free installation. 

One representative key-value store is Redis. Specifically, Redis is 
an in-memory key-value store, used as a database, cache and mes- 
sage broker. It supports various data structures including Strings, 
Lists, Sets, Sorted Sets, Hashes, Bitmaps and HyperLogLogs making 
it extremely powerful and allowing the execution of complex client 
functionality. As mentioned, Redis is an in-memory system, means 
that operations are executed extremely efficiently in memory and 
for this reason different functionalities can be achieved with low 
complexity, least amount of overhead on the network and low la- 
tency. Thus, it can handle extremely high throughput (millions of 
operations per second) compared with other partially disk-based 
database solutions that requires a large cluster of nodes to handle 
high-volume real time updates. 

The distributed implementation of Redis is called Redis Cluster. 
The nodes in Redis Cluster are responsible for holding the data, 
capturing the state of the cluster and mapping keys to the right 
nodes. The nodes in the cluster are connected through a service 
channel using a TCP bus and a binary protocol, called the Redis 
Cluster Bus [5]. To exchange and propagate information about the 
cluster, nodes use a gossip protocol in order to auto-discover new 
or existing nodes, to send ping-pong packets, to detect working or 
non-working nodes, to send cluster messages, to trigger specific 
conditions and to promote slave nodes to master when needed in 
order to continue to operate when a failure occurs [6]. 

Apache HBase is an open source distributed column-family based 
datastore built on top of HDFS that provides high scalability and 
fault-tolerance. Data are stored in labeled versioned tables which 
in turn stored as multidimensional sparse maps. Each table version 
represents an auto-assign timestamp created at cell creation time 
and on table creation a set of column families are defined. Each table 
consist of rows and columns where each row contains a sorting 
key and an arbitrary number of columns. Every column can have 
several versions for the same row key. Every cell is tagged by a 
column name and family while each row is sorted by a row key 
which serves as primary key. Select queries are executed based on 
table's primary key and each scan results into a MapReduce job [3]. 
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Figure 2: Spatio-temporal data types 


Each MapReduce job consist of a master node which is responsible 
for holding the cluster state, for assigning regions to regionservers 
and for recovering regionserver in case of failure [7]. It is worth 
noting that HBase clones BigTable from Google. The data model of 
these two system are very similar. Tables in HBase are automati- 
cally partitioned horizontally into regions. Every region contains 
a subset of table's rows. Similar to HDFS and MapReduce, HBase 
supports master/slave architecture. The three major components 
of the HBase are: HBaseMaster which has the responsibility for 
assigning regions to HRegionServes, HRegionServers which has the 
responsibility to handle client read and write requests and HBase 
client which has the responsibility to find HRegionServers that are 
serving a particular row range [8]. 

MongoDB is an open source document based datastore which sup- 
ported commercial by 10gen. Although MongoDB is non-relational, 
it implements many features of relational databases, such as sorting, 
secondary indexing, range queries and nested document querying. 
The use of operators like create, insert, read, update, remove as well 
as manual indexing, indexing on embedded documents and index 
location-based data also supported. In such systems, data are stored 
in collections called documents, consist of entities that provide 
some structure and encoding of the managed data. A collection is 
similar to a table in relational databases and is schema-free, which 
means that documents with different data structures can be stored 
in same collection [9]. Each document constitutes an associative 
array of scalar value, lists or nested arrays and has a unique special 
key " id" which is a 24-bits string calculated by the timestamp, 
host identifier, process identifier (PID) and a counter and used for 
explicitly identification. This field and the actual document are 
conceptually similar to a key-value pair. In each | id field is created 
a unique index by default [10]. 

MongoDB documents are serialized naturally as Javascript Ob- 
ject Notation (JSON) objects and stored internally using a binary 
encoding of JSON called BSON. Each BSON's maximum size is 
limited to 16MB. As all NoSOL systems, in MongoDB there are 
no schema restrictions and can support semi-structure data and 
multi-attribute lookups on records which may have different kinds 
of key value pairs [11]. In general, documents are semi-structured 
files like XML, JSON, YALM and CSV. For storing data in MongoDB 
there are two ways: a) nesting documents inside each other, an 
option that can work for one-to-one or one-to-many relationships 
and b) reference to documents than nesting the entire document, 
an option that the referenced document only retrieved when the 
user requests data inside this document [12]. 


Neo4j is a graph based datastore. It doesn't provide a standard 
SQL interface but direct REST requests. In these systems, a graphi- 
cal representation is used which can address scalability concerns. 
Graph structures are composed of edges, nodes and properties 
which provides index-free adjacency. Nodes and edges consist of 
objects with embedded key value pairs. Graph databases are spe- 
cialized on efficient management of heavily linked data and are 
optimized for highly connected data. In such systems cost intensive 
operations like recursive joins can be replaced by efficient graph 
traversal and graph pattern matching techniques. In case of graph 
traversal, the query processing starts from one node and then the 
other nodes are traversed based on the description query while 
on graph pattern matching techniques the defined pattern located 
in the original graph. Neo4j contains a mini-index in each vertex 
and edge of the objects connected to the graph and this typically 
means that the size of the graph has no performance impact upon a 
traversal as well as the cost of a local step (hop) remains the same. 
Also a global adjacency index is used to locate the starting point 
of a traversal. Indexes provide a fast and efficient way to retrieve 
vertices based on their values [13]. 

On the other hand, PostgreSOL belongs to the category of tradi- 
tional relational database management systems (RDBMS) and it is 
widely adopted in industrial and research settings. PostgreSQL is an 
open source object relational database system (ORDBMS) that uses 
and extends the SOL language [14]. It allows several well-known 
operations such as inserts, updates, deletes etc. and queries in data 
that stored in database [4]. A fundamental characteristic of Post- 
greSQL database is the support of user-defined objects including 
data types, functions, operators, domains and indexes. It supports 
multiple operators for querying, filtering, joining, grouping and 
modifying data. 

Some characteristics of such systems that have a significant im- 
pact on the scalability of the data stores are the data model, query 
model, partitioning, consistency and replication. The query model 
refers to data retrieval commands and querying languages that used 
to retrieve data that stored in the database. A commonly employed 
strategy for storing and processing massive datasets is the parti- 
tion of the data across different server nodes, thus achieving high 
availability and fault tolerance. Replication relates to dependability 
on database systems and refers to the process by which the same 
data are stored on multiple servers so that read and write opera- 
tions can be distributed over them. Replication also provides fault 
tolerance because data availability can withstand the failure of one 
or more servers. Strongly related to replication is the consistency 
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Table 1: Summary of key characteristics 
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Database System Data Model Query Model Partitioning Consistency Replication 
Redis Key-value store Data retrieval com- Range partitioning, Eventual consistency Master-slave 
mands with no queries Hash partitioning, asynchronous 
or query planner Consistent hashing replication 
abstractions in the 
middle 
HBase Column family Shell like command Tables are partitioned Strong consistency as HDFS to store 
store query. REST and Thrift by row-key into regions each record must be up- replication 
API are supported stored in different re- dated on assigned re- with selectable 
gion server gion server and replica- factors 
tion committed before 
read 
MongoDB Document Queries as BSON ob- Range partitioning Immediate consistency Master-slave 
store jects sent to MongoDB based on a shard key asynchronous 
driver replication 
Neo4j Graph store Cypher query language Cache-based Eventual consistency Master-slave 
match patterns of nodes 
and relationships in the 
graph 
PostgreSQL ORDBMS Utilizes the SOL query- Range partitioning, List Eventual consistency- Streaming 
ing language partitioning asynchronous write, replication, 
Strong consistency- Synchronous 
synchronous write replication 
(Serializable Transac- 
tions, Explicit Blocking 
Locks) 


level provided by the data store. Consistency is a system property 
that ensures that a transaction brings the database from one valid 
state to another. The consistency models are: strong, eventual or 
immediate consistency. Strong or immediate consistency ensures 
that when write requests are confirmed, the same (updated) data 
are visible to all subsequent read requests. In eventual consistency, 
changes eventually propagate through the system given sufficient 
time and therefore some server nodes may contain outdated data 
for a period of time. In general, in distributed systems there is 
a trade-off between consistency and availability of data. Table 1 
presents the examined systems classified based on the criteria of 
data model, query model, partitioning, consistency and replication. 
These criteria have a significant impact on the scalability of the 
data stores. 


4 SPATIO-TEMPORAL FUNCTIONALITY 


In this section are presented the spatio-temporal functionalities and 
the geospatial capabilities between the examined database systems. 
The most prominent differences concerning geospatial support 
relate to indexing, geometry types and query operators. 

Redis can efficiently handle and support geospatial data with 
the use of geospatial set and operations that can handle location- 
specific indexing, searching, updating and sorting in a simple way. 
With the combination of the built-in functions and data types, 
the infrastructure may provide reduced code complexity, reduced 
network bandwidth consumption and overall faster execution [15]. 
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For geospatial indexing, Redis uses the Geo Set. Geo Set is a 
data structure, implemented similar to another data structure called 
Sorted Set (basic data structure of Redis) and it is the basis for work- 
ing with geospatial data. Each Geo Set include a unique identifier 
and a coordinate pair (longitude, latitude). Several functions that 
used for geospatial index management are: Creation, Adding, Up- 
dating, Removing, Deleting, Reading and Searching the index with 
a list of geospatial commands (GEOADD, GEOPOS, GEOHASH, 
GEODIST etc.). Redis Geo Set allows storing and querying various 
geometry types such as: Point, Polygon, MultiPolygon, MultiPoint, 
LineString, MultiLineString and GeometryCollection. 

As mentioned above, the Geo Set data structure is similar to a 
Sorted Set. A Sorted Set is a mix between a Set and a Hash. Like 
Sets, it contains unique string elements and every element is asso- 
ciated with a floating point value, called score, just like Hash. The 
elements inside a Sorted Set sorted by their score allowing ordering 
and searching for members by their rank or score, which is 64-bit 
floating point number. The main difference between a Sorted Set 
and a Geo Set is that the score in the latter is used for store the 
location. Location in substance constitutes a coordinate pair, lon- 
gitude and latitude. The main functionality Geo Set provides, is 
the encoding and decoding of such coordinate pairs and numerical 
scores. For translating these two representations Redis implements 
the Geohash system. 

Geohash algorithm is a latitude/longitude geocode system that 
used for encoding and decoding coordinate pairs in a compact form 
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and divides geographic regions into a hierarchical structure [16]. 
Coordinates are converted into a string using a base-32 character 
map. A Geohash string represents a spatial bounding box, thus Geo- 
hash divides geographic space into buckets of grid shape [17]. Redis 
uses Geohash to map coordinate pairs and their hash values and 
stores in each member’s score the hash’s numerical representation. 
The Geohash system divides the world into rectangular cells where 
each such cell is uniquely identified by its hash value. The cells’ 
hashes are computed by interleaving the information from both 
location coordinates into a single value. At first, the algorithm takes 
a coordinate and tests if the longitude is on the left or right of the 
Prime Meridian. If it’s in the right then the hash’s most significant 
bit is turned on to 1 otherwise is turned to 0. In the next step the 
same logic applied to latitude. If lies north to the Equator the hash’s 
next bit is turned to 1 leading to the binary value 11. This process 
continues by dividing that hash’s cell according to the longitude 
again and by repeating Geohash algorithm, the resulting binary 
value represents a location with increasing degree of accuracy. The 
accuracy is proportional to the number of iterations that are per- 
formed. Redis allows 26 Geohash iterations at max that produce 
52-bit long hash values which in turn provide an accuracy error at 
about 0.6 m. Figure 3 illustrates the Geohash algorithm with the 
logic behind bits representation [15]. 

Geospatial indexing through Geohash, only supports two spatial 
dimensions with no regard to altitude. To manage 3-dimensional 
geospatial data, Redis uses a combination of two well-known data 
structures, GeoSet that used for storing coordinates pairs and Sort- 
edSet for storing the elevation of each member’s score through a 
xyzset module. 


Meridian-0" i 


rime 


Figure 3: Geohash algorithm 


HBase fails to provide in-built spatio-temporal querying capa- 
bility. For spatial functionality, a scalable data storage solution 
called HBaseSpatial exists as shown in [18]. In general, the system 
provides efficient and effective distributed storage and vector data 
indexing and can support large scale spatial vector data storage 
and management. HBaseSpatial is divided into two main parts, the 
storage model and the index model. The storage model receives 
the vector data from shapefiles then puts these data into the index 
model and finally converts these data to WKB type and stores them 
in the HBase table. The index algorithm calculates the id of each 
vector data and put them into the index table. Range searches are 
efficiently improved by the use of this secondary index method. 
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Vector spatial data include attribute and topology data and spatial 
coordinates. Spatial attributes of such data contain a large number 
of geometry coordinates and for this reason the WKB format is 
used for storing the information in binary format. For indexing, the 
grid spatial partition index method is used, where the global scope 
of the latitude and longitude pair is divided into different levels of 
the grid. 

An another system that provides spatial functionality to HBase 
is presented in [19]. STEHIX (Spatio-TEmporal Hbase IndeX) index 
structure is suitable to process spatio-temporal queries and it is 
a two-level lookup mechanism of HBase. At first, with the use of 
Hilbert curve, geolocation data linearized and stored in a meta table 
and then for each region an index mechanism is used for storage 
files. Also a system called MD-HBase [20] adds an index structure 
to meta table of HBase but does not provide an index to lookup 
inner data of regions. Finally, a system called GeoMesa? provides 
spatio-temporal indexing on top of many systems including HBase. 
GeoMesa is an open source tool that enables large scale geospatial 
querying and analytics on distributed computing systems. 

For spatial functionality in MongoDB, data are stored either as 
GeoJSON objects which is a format for encoding a variety of geo- 
graphical data structures or as legacy coordinate pairs (MongoDB 
versions 2.2 and earlier). GeoJSON supports a) Geometry types 
as: Point, LineString, Polygon, MultiPoint, MultiLineString, Multi- 
Polygon and GeometryCollection b) Feature, which is a geometric 
object with additional properties and c) FeatureCollection, which 
consist a set of features [21]. Each GeoJSON document is composed 
of two fields: i) Type - the shape being represented, which informs 
a GeoJSON reader how to interpret the "coordinates" field and b) 
Coordinates - an array of points, the particular arrangement of 
which is determined by "type" field. In MongoDB, the geograph- 
ical representation need to follow the GeoJSON format structure 
in order to be able to set a geospatial index on the geographic 
information. MongoDB supports BTree indexes (not R-trees) to 
support specific types of data and queries such as: Single Field, 
Compound Index, Multikey Index, Text Indexes, Hashed Indexes 
and Geospatial Index. To support efficient queries on geospatial 
coordinate data, MongoDB provides two special indexes: 2d indexes 
that uses planar geometry when returning results and 2dsphere 
indexes that use spherical geometry to return results. A 2dsphere 
index supports queries that calculate geometries on an earth-like 
sphere and supports all MongoDB geospatial queries: queries for in- 
clusion, intersection and proximity. 2d indexes support queries that 
calculate geometries on a two-dimensional plane i.e. queries that 
interpret geometries on a flat surface and some spherical queries 
but do not support GeoJSON-formatted queries or GeoJSON data 
values. Also MongoDB supports Geo Haystack index that used 
to query small areas but nowadays it is less used by applications. 
MongoDB computes the geohash values for the coordinate pairs 
and then indexes the geohash values. Concerning spatio-temporal 
functionality MongoDB supports four geospatial query operators: 
$geoIntersects, $geoWithin, $near and $nearSphere. 

In Neo4j, a plugin called Neo4j-Spatial exists and supports vari- 
ous geometry types such as Geometry, Point, LineString, Polygon, 
MultiPoint, MultiLinestring and MultiPolygon. The spatial queries 


3GeoMesa, https://github.com/locationtech/geomesa 
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Table 2: Summary of spatio-temporal key characteristics 


Database System Geometry Types Geospatial Indexing Spatial query operators-functions 

Redis Point, LineString, Polygon, MultiPoint, Geo Set Geoadd, Geopos, Geohash, Geodist, 
MultiLineString MultiPolygon, Geome- Geopathlen, Georadius, Georadius- 
tryCollection bymember, Geoencode, Geodecode, 

Geometry filter 

HBaseSpatial Point, LineString, Polygon, MultiPoint, Grid spatial in- Range queries of vector spatial data, k- 
MultiLineString, MultiPolygon, Simple- dex method NN queries 
FeatureType, GeometryCollection 

MongoDB Point, LineString, Polygon, MultiPoint, 2dsphere, 2d $geoIntersects, $geoWithin, $near, 
MultiLineString, MultiPolygon, Geome- $nearSphere 
tryCollection, Feature (geometric object 
with additional properties), FeatureCol- 
lection (a set of features) 

Neo4j Geometry, Point, LineString, Polygon, RTree Contain, Cover, Covered By, Cross, Dis- 
MultiPoint, MultiLinestring, MultiPoly- joint, Intersect, Intersect Window, Over- 
gon lap Touch, Within, Within Distance, 

Area, BBox, Boundary, Distance, Buffer, 
Centroid, ConvexHull, Envelope 
PostgreSOL Point, LineString, Polygon, MultiPoint, Generalized ST Within, ST Intersects, ST DWithin 


MultiLineString, MultipPolygon, Geom- 
etryCollection 


implemented, include the following topological functions: Con- 
tain, Cover, Covered By, Cross, Disjoint, Intersect, Intersect Win- 
dow, Overlap, Touch, Within and Within Distance. Moreover the 
analysis functions provided are: Area, BBox, Boundary, Distance, 
Buffer, Centroid, ConvexHull, and Envelope and the set functions 
include Difference,Intersection, Union and SymDifference meth- 
ods. Neo4j-Spatial can import data in both ESRI Shapefile (SHP) 
and Open Street Map (OSM) formats. Each format provides different 
layers which in turn support different geometry types. A single 
layer can be divided into multiple sub-layers through the use of 
pre-configured filters, which can be proven efficient when working 
with large datasets. In addition, each spatial data layer has its own 
configuration of the coordinate system obtained from the input 
files (SHP or OSM) [22]. Concerning indexing, Neo4j-Spatial uses 
RTree for spatial queries which is suitable for 2-dimensional and 3- 
dimensional spatial data. Typically, with the use of an RTree, every 
geometry is grouped and represented with its minimum bounding 
rectangle in the next-higher level of the tree. For graph traversal, 
where a node or a relationship needs to be found based on a prop- 
erty, a spatial lookup is performed for an increased performance. 
The system uses an R-tree index structure only to retrieve the start 
elements and from that point onwards an index-free traversal is 
executed through the graph [23]. 

In PostgreSQL, there is a special extension called PostGIS that 
integrates several geofunctions and support geographic objects. 
PostGIS contains more than one thousand geofunctions [21] and 
according to [4] can be divided into five categories: management, 
conversion, retrieval, comparison and generation. In general, Post- 
GIS provide spatial services such as spatial objects, spatial indexes, 


^Neo4j Spatial, https://github.com/neo4j-contrib/spatial 


Search Tree 
(GiST) 


+ Order by distance, ST Area ... 


spatial operators and spatial manipulation functions [24]. It sup- 
ports several geometry types as: Points, LineStrings, Polygons, Mul- 
tiPoints, MultiLineStrings, MultipPolygons and GeometryCollec- 
tions. In general PostGIS implementation is based on "light-weight" 
geometries and the indexes are optimized to reduce disk and mem- 
ory footprint. 

Concerning indexing, PostgreSOL supports several types of in- 
dexes such as BTree, RTree, Hash, Generalized Inverted Indexes 
(GIN) and Generalized Search Tree (GiST) called R-tree-over-GiST. 
BTree is the default type of index used in one-dimensional ordered 
data and can be used efficiently for equality and range queries with 
all datatypes. With the use of RTree indexing, data are divided into 
rectangles and this index is suitable for two-dimensional spatial 
data. For general balanced tree structures and high-speed spatial 
querying, PostgreSQL uses GiST indexes that can be used to index 
the geometric data types. GiST stands for "Generalized Search Tree" 
and is suitable for speeding up search queries on all kinds of irreg- 
ular data structures. BTree on the other hand cannot accomplish 
this functionality and GiST have two advantages over RTree and 
Btree. At first, GiST can used to index columns with null values and 
moreover can support the concept of "lossiness" as mention in [25], 
which means that in case of large GIS object, only the significant 
part of an object is stored, just the bounding box. Moreover, GIS 
objects larger than 8K will lead RTree indexes in failure. 

Table 2 presents the examined systems classified based on the 
spatio-temporal criteria of geometry types, geospatial indexing and 
spatial query operators. 


5 CONCLUSIONS 


As more applications are dependent on data of spatio-temporal 
nature, the DBMS community will continue to seek more efficient 
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ways to support them. Redis, HBase, MongoDB, Neo4j and Post- 
greSQL, are all representative cases of DBMS that provide geospatial 
querying capabilities. Each of those systems is based on a different 
underlying technology and model resulting in varying geospatial 
indexing methods, geometry types and spatial query operators. 
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ABSTRACT 


The introduction of flash SSDs has accelerated the performance of 
DBMSes. However, the intrinsic characteristics of flash motivated 
many researchers to investigate new efficient data structures. The 
emergence of 3DXPoint, a new non-volatile memory, sets new 
challenges: 3DXPoint features low latency and high IOPS even at 
small queue depths. However, the cost of 3DXPoint is 4 times higher 
than that of a flash-based device, rendering hybrid storage systems 
a good alternative. In this paper we pursue exploiting the efficiency 
of both 3DXPoint and flash-based devices introducing H-Grid, a 
variant of Grid-File for hybrid storage. H-Grid uses a flash SSD as 
main store and a small 3DXPoint device to persist the hottest data. 
The performance of the proposed index is experimentally evaluated, 
comparing it against GFFM, a flash efficient implementation of Grid 
File. The results show that H-Grid is faster than GFFM execution 
on a flash SSD, reducing the single point search time from 35% up 
to 43%. 
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1 INTRODUCTION 


The emergence of non-volatile memories (NVM) has enabled new 
storage devices with amazing features like ultra-high read and write 
speeds, small size, low power consumption and shock resistance. 
Nowadays, flash-based solid state drives (SSDs) are found in the vast 
majority of consumer computer systems, as well as in almost every 
data center. Data intensive applications, like DBMSes, have drawn 
significant performance advantages by this evolution. As a result, 
index structures for flash SSDs have become a promising field of 
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study for many researchers. Most of the presented works concern 
tree indexes for one- [1, 11, 16, 25, 26] and multi-dimensional data 
[3, 10, 14, 28, 30], while fewer exist investigating flash efficient 
hashing methods [5, 17]. Briefly, the majority of proposals aim at 
reducing the number of small random writes that deteriorate the 
performance of SSDs while avoiding the mingling of reads and 
writes for the same reason. On the other hand, they seek to exploit 
the high internal parallelism of modern devices. Thus, some well 
known techniques which are employed to meet these objectives are: 
i) postponing of write operations and performing them in batches, 
ii) buffering of retrieved read pages, iii) applying logging and iv) 
grouping of page read requests. 

A new class of SSDs was introduced by Intel, under brand name 
“Optane”, earlier in 2017. These storage devices are based on 3DX- 
Point non-volatile memory technology. 3DXPoint uses a layered 
crosspoint architecture, permitting individual addressing of each 
memory cell. Opposite to flash, it supports in-place-writes, reliev- 
ing the SSD controller from the burden of maintaining out-of-place 
updates and garbage collection operations. It provides up to 10° 
better access times compared to NAND flash while its density is 10 
times higher than that of DRAM [22]. Therefore, [8] proposed two 
more possible uses of 3DXPoint, other than as secondary storage: i) 
as a low cost extension of DRAM, and ii) as persistent main memory 
directly accessed by the CPU. 

The efficiency of a storage device is described by three perfor- 
mance metrics: IOPS, bandwidth and latency. IOPS determine the 
number of I/O operations that the device is able to carry out over 
the unit of time. On the other hand, the bandwidth expresses the 
throughput that a drive can deliver, measured in MBs/sec. Finally, 
latency is the amount of time that an I/O request takes to complete, 
i.e., the response time of an operation. Latency is of paramount 
importance for the efficiency of a storage system, since low latency 


is tightly connected with better user experience. Little's law [18] 
.. Queue 

" Latency 
is the number of outstanding requests, i.e. the number of I/O re- 


for storage systems mandates that IOPS , Where Queue 
quests sent to the device in parallel. It is clear that reducing latency 
retains IOPS efficiency even with less concurrent I/O. Lower la- 
tency values enable workloads to finish into a fraction of the initial 
time. According to [32] the latency of high-performance NVMe 
SSDs contributes over 19% of the overall response time on online 
applications. New SSD devices have been introduced lately provid- 
ing ultra-low latency; such devices are Intel's Optane series and 
Samsung's Z-NAND. Intel Optane SSDs are based on 3DXPoint 
non-volatile memory and can provide a latency reduction of one or- 
der of magnitude compared to the conventional NAND flash SSDs. 
3DXPoint SSDs can deliver high IOPS even when a small number 
of concurrent outstanding I/O is used (small queue depth), while 
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their NAND counterparts are more efficient under large batched 
I/O [8, 13]. 

Previous works for flash efficient database indexes focus on 
exploiting the high internal parallelism of SSD devices by issuing 
multiple read or write operations at once. Several works utilize large 
queue depths, pursuing to distribute the workload among multiple 
NAND chips, accelerating query performance [27]. Although this 
technique has been proved very useful so far, the low latency of new 
NVM technologies can further improve the performance, especially 
where limited opportunity for grouping I/O requests exists. To the 
best of our knowledge, this is the first time that the low latency 
3DXPoint NMV is exploited to accelerate the performance of a 
spatial index. 

The contributions of this paper can be summarized as follows: 


e We introduce a new spatial index structure, the H-Grid (Hy- 
brid Grid-File), which is designed for hybrid storage. Partic- 
ularly, we consider flash SSDs as the mass storage tier and 
3DXPoint ones as the performance tier. 

e We present a hot region detection algorithm that recognizes 
regions of high interest, storing them to the performance 
tier. 

e We evaluate our H-Grid through extensive experimentation, 
utilizing one real and two synthetic datasets. We study single 
point search queries, region and kNN queries as well. 


The remainder of this paper is organized as following. Section 2 
describes the related work in hybrid storage systems and hybrid 
indexes. The design and implementation details of H-Grid are un- 
folded in Section 3. Section 4 presents the experimental results and, 
finally, our conclusions are listed in Section 5. 


2 RELATED WORK 


Hybrid storages are not rare in database systems; several algorithms 
have been proposed so far [24]. Most of the related works, until now, 
consider flash-based solid state drives as the performance tier and 
magnetic disks as the storage tier. In fact, hybrid storage systems 
employ SSDs either as a cache between main memory and HDD or 
as high performing devices storing permanently the hottest data. 

In [2] a flash based SSD acts as an extension of the standard main 
memory bufferpool accommodating high priority data. The hot data 
regions are identified using frequency and recency statistics, while 
an aging mechanism ensures that the cached regions are in line 
with the I/O pattern, as it changes over the time. The authors in [19] 
study different buffer management policies in relational DBMSes 
(i.e. MySQL), when a hybrid SSD/HDD scheme is used as persistent 
storage. Their findings indicate that the performance of hybrid 
systems, which employ SSDs for caching, is highly dependent on 
the ratio between SSD and HDD bandwidth. 

Hystor [4] is an extension to Linux operating system that iden- 
tifies hot and performance critical data blocks by monitoring the 
I/O sequence. This data is stored on a fast SSD instead of a mag- 
netic disk. Following a different roadmap, MOLAR [20] proposes 
the implementation of the hot page detection mechanism into the 
SSD's controller. Simulated experiments have shown that MOLAR 
can reduce the average write latency in SSDs by 3.5 times. 
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The efficiency of hybrid storage systems is connected with the 
accuracy of hot data identification. [21] uses a probabilistic algo- 
rithm to locate hot data. The algorithm maintains two probabilities. 
The first probability contributes to the decision of which pages 
should be evicted from RAM and the second one determines the 
persistent storage (SSD/HDD) an evicted page should be moved to. 

Although many flash efficient database indexes have been pro- 
posed so far, there exist only a few hybrid ones. The HybridB tree 
[12] is a B+tree variant for hybrid SSD/HDD storage. It always keeps 
the internal nodes in the SSD, while it distributes the leaf node pages 
between HDD and SSD. Specifically, it adopts a huge-leaf organiza- 
tion for the leaf nodes, aiming to reduce costly splits and merges. 
A huge-leaf occupies two or more pages in the secondary storage 
and includes a special node for metadata and a logging part as 
well. The XB- Tree [15] is a hybrid index for PCM/RAM memory. 
PCM is utilized as non-volatile random access memory along with 
DRAM rather than as a secondary storage. The proposed index dis- 
tinguishes nodes according to their read/write tendency, retaining 
write intensive nodes in DRAM, while it stores the read intensive 
ones to PCM. In this way it reduces costly write operations in PCM, 
while simultaneously increases the overall performance. 

A recent study [31] investigates the use of 3DXPoint technology 
to enhance the performance of database systems. Specifically, the 
authors recognize write amplification, careless use of temporary 
tables and bufferpool cache misses as factors that degrade query 
performance. In the sequel, they experimentally show that an en- 
terprise class 3DXPoint SSD can improve query performance by 
1.1-6.5x compared to a flash counterpart. 


3 THE H-GRID 


Spatial data structures are of paramount importance for spatial 
query processing. They represent simple or complex spatial objects 
(e.g. points, lines polygons, etc) in a manner that simplifies execu- 
tion of spatial queries [29]. In our previous work [5, 6] we utilized 
flash SSDs to enhance the efficiency of Grid File [23]. In this paper, 
our objective is to take advantage of a new non-volatile memory 
technology, the 3DXPoint. Therefore, we introduce the H-Grid, a 
Grid File variant for hybrid storage. 


3.1 H-Grid Design 


A common method on past research for flash efficient database in- 
dexes is to group I/O operations, exploiting the high bandwidth, the 
internal parallelism of modern SSDs and the efficiency of NVMe pro- 
tocol. This strategy provides sufficient results, especially in range 
and kNN queries as they usually involve access to multiple pages. 
Furthermore, some authors propose grouping of incoming single 
key search requests into sets that are processed simultaneously [27]. 
However, in all aforementioned cases, accessing the upper level 
nodes of tree-indexes does not always exploit the full bandwidth of 
SSDs, even if multiple nodes are fetched with a single I/O request. 
The performance characteristics of 3DXPoint, i.e. its efficiency even 
at small size I/O, motivated us to introduce H-Grid. H-Grid is a 
Grid-File variant designed for hybrid, 3DXPoint/flash storage. It 
exploits a frequency based model for data placement. It detects per- 
formance critical regions placing them to the low latency 3DXPoint 
storage, while it leaves the rest of them to the flash SSD. To the best 
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Figure 1: Overview of H-Grid. A part of the Grid-File is 
migrated to the 3DXpoint storage. 


of our knowledge, H-Grid is the first attempt to introduce a spatial 
index that exploits hybrid 3DXPoint/flash I/O. 

A running example of H-Grid is illustrated in Figure 1. The H- 
Grid implementation follows the two-level Grid File design as it is 
presented in [9]. Thus, H-Grid employs a small, memory resident 
Root Directory (Fig. 1a) and many sub-directories that reside in 
the physical storage. The sub-directories hold the addresses of data 
buckets that contain the actual data. The sub-directories and the 
data buckets can reside in either a flash (Fig. 1b), or a 3DXPoint SSD 
(Fig. 1c). A selection algorithm locates frequently accessed regions 
that are eligible for the 3DXPoint storage, considering weight values 
for each retrieved directory or data bucket page. These weights 
are calculated using the access frequencies of the pages. We use 
two hashing tables (one for directory and one for data pages) to 
associate each page in the 3DXPoint storage with its corresponding 
weight value. 

H-Grid leverages in-memory buffers to accommodate pages that 
are either retrieved from the SSDs or temporary stored prior to a 
batch write operation [5, 6]. It employs separate buffers for sub- 
directories and data buckets, enabling different buffering polices 
that rely upon the page type (directory/data). At the moment, we 
utilize LRU as eviction policy in both buffers. The dirty evicted pages 
are not persisted immediately; they are accumulated into write 
buffers instead, enabling batch writes that accelerate performance. 

We also examine a special case of H-Grid (Fig. 2), where all sub- 
directories are placed to the 3DXPoint storage, along with a number 
of selected data buckets. This approach can provide additional 
performance gain, since the sub-directories are referenced more 
frequently and their access pattern usually involves small-size I/O. 
The induced space overhead is not prohibitive since, as experimental 
results indicate, the size of directory pages in the physical storage 
is two orders of magnitude less than that of data buckets. The 
algorithms in the rest of the document were modified accordingly 
to comply with this special case. 

In the sequel, we describe the Hybrid Bucket detection algorithm 
and the Search/Insert operations in H-Grid. 
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Figure 2: H-Grid special case. All sub-directories are hosted to 
the 3DXPoint SSD. 


3.2 Hot Region Detection Algorithm 


The role of the hybrid bucket detection algorithm is to reveal the 
most important, from performance viewpoint, sub-directories and 
data buckets. Only these will be migrated to the 3DXPoint storage. 
We use a temperature-based model to identify hot spatial regions 
that impose the highest I/O cost. These regions are represented 
by a number of sub-directories and data buckets. The weight of 
a sub-directory is highly correlated with the number of previous 
requests for it. 

Equation 1 provides a metric for the weight W? of a certain 
sub-directory i. 


ie 
Oo _ o L 
L 
The first term expresses the frequency of accesses to the specific 
sub-directory, normalized into the range [0,1]. The second term 
refers to an aging policy, providing an advantage to sub-directories 
that were recently accessed. T? is the current timestamp, while t? 
is the timestamp of the previous access of sub-directory i. In other 
words, the second term reflects the changes occurring in the access 
patterns over time. 
Regarding data buckets, we use a similar policy, as expressed by 
Eq. 2. 
Ê 
p m oO F?) _ 1 = j 2 
Wr = (WF +F; 7 (2) 


T; 
j 


Specifically, we utilize the number of read requests F; for the bucket 
j and the weight W? of its parent sub-directory i to determine its 
eligibility. The aging factor is also applied to decay the weight 
of buckets that are rarely used. An additional condition for the 
data buckets is the presence of their parent sub-directory in the 
3DXPoint storage as well. 

The selection Algorithm (Alg. 1) uses the weight values to iden- 
tify the hottest buckets. Only these are migrated to the 3DXPoint 
storage. The algorithm initially calculates the weight of a bucket 
(lines 1-3). In the sequel, it uses the cumulative moving average 
(CMA) of the weights (line 7) to determine the bucket’s eligibility 
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Algorithm 1: HybridBucketDetect(B, S, WS) 
Data: the bucket B, the parent sub-directory S, the weight of 

the parent sub-directory WS 

Result: Bucket is set hybrid or not 

F <— getBucketStats(B.id); 

D — (T - B.D/T; 

3 W — WS +F- D; 

4 Wsum — Wsum + W; 

5 ttn 

6 if S not in 3DXPoint then 

7 return 0; 


m 


N 


s end 

9 CA — Wsym/n; 

10 if W > s» CA then 

1 B.setHybrid — 1; 

12 HBT[B.id]— W; 

13 set the 3DXPoint dirty flag of B; 
14 return 1; 

15 end 

16 return 0; 


for the 3DXPoint storage. The simple moving average (SMA) in se- 
quential time windows can be used alternatively. Upon a hot bucket 
is detected, a dirty flag is set, forcing the bucket to be written on 
the 3DXPoint SSD during the next write-buffer flush (line 13). The 
parameter s is a tunable constant which controls the selectivity of 
the algorithm. HBT (Hybrid Bucket Table) is a hash table that maps 
all buckets in the 3DXPoint storage to their respective weights. 
The weight value of a bucket in the HBT is updated every time the 
bucket is retrieved. This algorithm is adapted for the sub-directories 
as well. 


3.3 Single Point Search 


In the two-level Grid-File, the search operation starts from the 
in-memory root directory by locating the sub-directory which con- 
tains a particular point. When the sub-directory is retrieved, the 
procedure continues at the sub-directory level, looking for the ap- 
propriate bucket. In this way, the Grid-File guaranties that a single 
point is reached in two disk accesses. 

In H-Grid the search operation is adjusted to the hybrid storage 
configuration. Algorithm 2 describes the operation for a given 
point p at sub-directory level, while it adapts similarly at the root 
level. Initially, the linear scales and the grid are used to find out 
the address of bucket B that contains p. A fetch operation for B 
is issued either to flash or to the 3DXPoint storage (line 3). If B 


already resides in the 3DXPoint, its weight value we is updated. 
Otherwise, Algorithm 1 is employed to decide if B is eligible for 
migration (lines 4-8). Finally, the last access timestamp of B is 
updated and B is returned. 

Algorithm 3 details the bucket fetching operation in H-Grid. If 
the requested bucket B is already in the in-memory buffer (MB) 
or into one of the two write buffers (flash or 3DXPoint), then B is 
moved to the most recently used (MRU) position of MB. Otherwise, 
a fetch operation from the secondary storage is initiated. The HBT 
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Algorithm 2: Search(p, S, WS) 


Data: the search point p, the parent Sub-directory S, the 
weight WS of the parent sub-directory 

Result: the bucket B wherein p is located 

search the scales to convert the coordinates of p into interval 


m 


indexes; 
2 use interval indexes to locate bucket B in the sub-directory; 
3 FetchBucket(B); 
4 if HBT|[B.id] is not NULL then 
5 | update HBT[B.id] with new weight value; 
6 else 
7 | HybridBucketDetect(B,WS); 
s end 
» update B timestamp; 
10 return B; 


Algorithm 3: FetchBucket(B, HBT, MB) 


Data: the id of bucket B to be read, the Hybrid Bucket Table 
HBT, in-memory buffer MB 

Result: the bucket B 
1 if B is in main memory buffer MB then 
2 move B to the MRU position of main buffer; 
lse if B is in flash SSD write buffer then 
4 move B to the MRU position of main buffer; 
5 else if B is in 3DXPOINT SSD write buffer then 
6 move B to the MRU position of main buffer; 
Ise if HBT[B.id] is not NULL then 
8 read B from 3DXPOINT SSD; 
9 move B to the MRU position of main buffer; 


w 
© 


EC] 
© 


10 else 
11 read B from flash SSD; 
12 move B to the MRU position of main buffer; 


13 end 
14 return B; 


table is examined and a bucket read request is issued to the appro- 
priate storage device. By the end of the operation, B is placed to the 
MRU position of the main buffer and a reference to it is returned. 

From the above, it is obvious that the two disk access principle 
of Grid File is also preserved in H-Grid. The cost of searching a 
single point in H-Grid is determined by the cost of retrieving the 
directory and bucket pages from the physical storage. 

Thus, for a given search query Q, let x; € (0,1) represent 
whether sub-directory s is stored in the 3DXPoint storage or not, 
and xp € (0,1) denote whether the bucket b is in 3DXPoint or not 
as well. The cost Cg of Q is 


Co = xs * Rx + Rp *(1— xs) + xp * Ry + Rp * (1 — xp) 
= 2* Rp - (Rf — Rx) * (Xs + xp) 
where Ry and Rx denote the cost of reading a page from the flash 
and 3DXPoint, respectively. The wider the difference in page read 


time between flash and 3DXPoint gets, the higher the performance 
gain of H-Grid becomes. 
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3.4 Insert Point 


Algorithm 4 describes the insertion of a new point to the H-Grid. 
It receives as input a point p and exploits the Search operation to 
acquire the bucket B wherein p has to be inserted. If B is not full, a 
proper record is composed and is added to it (lines 2-14). In case 
B resides in the flash SSD, the hybrid bucket detect operation is 
invoked, testing its eligibility for migration to the 3DXPoint storage. 
A proper dirty flag is set denoting bucket’s storage medium. This 
flag is exploited by the write operation. Each bucket B accommo- 
dates a certain number of records. In case B is full, a split operation 
of B is initiated, resulting in the introduction of a new bucket. Suc- 
cessive insertions of new records may cause a sub-directory split 
as well. 


Algorithm 4: Insert(p, S, WS) 


Data: the new entry p to be inserted, the parent sub-directory 
S, the weight of the parent sub-directory WS 
1 B e Search(p, S, WS) 
2 if B is not full then 


3 insert record (p) to B; 

4 if HBT[B.id] is not NULL then 

5 | set the 3DXPoint dirty flag of B; 
6 else 

7 if not HybridBucketDetect(B,S,WS) then 
8 | set the flash dirty flag of B; 
9 end 

10 end 

1 update B timestamp; 

12 return 1; 

13 else 

14 split bucket B; 

15 Insert(p, S, WS) 


16 end 


4 PERFORMANCE EVALUATION 
4.1 Methodology and setup 


In this section we present the evaluation of H-Grid using both flash 
and 3DXPoint storage devices. We present the performance benefits 
of H-Grid against flash efficient (GFFM [5]) and traditional (R*-Tree 
[7]) spatial indexes that are unable to exploit diverse storages. We 
also test the special case of H-Grid, presented in Section 3.1, that 
persists all its sub-directories to the 3DXPoint storage. 

All the experiments were performed on a workstation running 
CentOS Linux 7 (Kernel 4.14.12). The workstation is equipped with 
a quad-core Intel Xeon CPU E3-1245 v6 3.70GHz CPU, 16GB of 
RAM, and a SATA SSD for hosting the operating system. The ex- 
periments were conducted on an INTEL DC P3700 480GB PCI-e 
3.0 SSD (FLASH) and an Intel Optane 32GB Memory Series device 
(3DXPoint). The latter belongs to the first generation of devices 
utilizing 3DXPoint memory. Table 1 summarizes the performance 
characteristics of the two devices as provided by manufacturers' 
data sheets. 
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Table 1: SSD Characteristics 


Intel DC P3700 | Optane Memory series 

(Flash) (3DXPoint) 
Seq. Read up to 2700MB/s up to 1350MB/s 
Seq. Write up to 1100MB/s up to 290MB/s 
Random Read 450K IOPS 240K IOPS 
Random Write 75K IOPS 65K IOPS 
Latency Read 1204s 7us 
Latency Write 30s 18ys 


We use two synthetic and one real dataset for the experiments. 
The synthetic datasets follow Gaussian and Uniform distributions, 
respectively, while the real one contains geographical points ex- 
tracted from Openstreetmap!. All experiments were executed using 
the Direct I/O (O DIRECT) option to bypass the Linux OS caching 
system. We varied the selectivity parameter s (Alg. 1) in the range 
1.0 to 2.5 in the various workloads. In this way, a number of up to 
40% of the sub-directories and up to 20% of data buckets migrated 
to 3DXPoint. We set the total size of the in-memory buffers for 
every examined index to 8MB. We did not manage to run R*-tree 
on the 3DXPoint using the real dataset due to lack of space. 


4.2 Insert/Search Queries 


We evaluated the performance of H-Grid using six different work- 
loads for each dataset. Regarding the real dataset, the indexes were 
initialized with 500M points. Figure 3a presents the elapsed time for 
10M operations with the specified search and insert ratios. Specifi- 
cally, H-Grid achieves a speedup which ranges from 18.4% to 43% 
in comparison to the execution of GFFM on the flash SSD (baseline). 
H-Grid does not provide adequate results when the buckets for the 
3DXPoint are randomly selected. This fact reveals the efficiency of 
the proposed hot region detection algorithm. The special case of 
H-Grid, which considers placing all sub-directories in the 3DXPoint 
storage, achieves a significant performance gain that ranges from 
34.4% to 56.6%. The acquired results are even better when GFFM 
exclusively utilizes the 3DXPoint SSD as persistent storage (best 
case), providing a speedup reaching 78.9% in comparison to the 
execution on the flash SSD. 

For the synthetic dataset runs, we used 50M points for initial- 
ization and 5M I/S operations for testing. As depicted in Figures 
3b and 3c, there is remarkable improvement in all experiments in- 
volving read sensitive workloads. Figure 3b presents the results for 
the Gaussian dataset. Specifically, using the 3DXPoint SSD as sole 
storage medium for GFFM, we achieve an improvement ranging 
from 49.7% to 77.9% comparing with its execution on the flash one. 
H-Grid achieves a performance gain up to 35% in comparison to 
the GFFM run on the flash SSD. The special case of H-Grid exhibits 
even better performance (29.6%-52.7%) as expected. The results 
are similar in the test cases that utilize the uniformly distributed 
dataset (Fig. 3c). The H-Grid is up to 35% faster than the GFFM 
execution on the flash SSD, while the special case improves further 


!http;//spatialhadoop.cs.umn.edu/datasets.html 
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(c) Uniform dataset 


Figure 3: Execution times of I/S queries for different workloads. H-Grid provides better results when searches are the majority. 
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Figure 4: Execution times of kNN queries for different workloads. 
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the result. The low-latency of 3DXPoint SSD also contributes signif- 
icant performance gains for R*-Tree comparing with its flash-based 
execution. 


4.3 KNN Queries 


In this section we analyze the performance of kNN queries. We 
previously initialized the indexes, using the same datasets as in 
T/S queries (500M points for the real dataset, 50M points for each 
synthetic). Figure 4a depicts the results of the real dataset, while 
Figures 4b and 4c correspond to the Gaussian and Uniform test 
cases. Regarding the real dataset, H-Grid provides a gain up to 17% 
in the 100NN case, while in the 1000NN case the gain is only 4.7%. 
This is due to the large number of bucket reads that imposes. The 
results are better in the smaller synthetic workloads. Particularly, 
for the Gaussian dataset, the improvement ranges from 12.3% for 
the 10NN query, and up to 31% for the 1000NN one. Similarly, in 
the experiments with the Uniform dataset, a speedup ranging from 
12.4% up to 30% is achieved. H-Grid achieves better results when all 
the sub-directories reside in the 3DXPoint (special case). Specifically, 
for the real dataset, it improves its execution time starting in a range 
from 13.6% up to 23.5%. 


4.4 Range Queries 


We discuss the performance of range queries next. Specifically, 
we present the elapsed times of 5K requests issued to each one 
of the examined indexes. We posed 1M queries to the previously 
initialized indexes. Figure 5 summarizes the results for all test cases. 
H-Grid improves GFFM on flash SSD (baseline case) up to 28% in 
the real dataset run, while the gain for H-Grid is smaller in the 
runs that use the synthetic data. The efficiency of the proposed 
hot region detection algorithm is proven to be true once again, 
since the random selection of buckets for migration leads to worse 
results. The sole execution of GFFM in the 3DXPoint SSD provides 
significant performance improvements which range from 74.5% to 
78%. Remarkable is also the speedup for the R*-tree (75%), when it 
utilizes the 3DXPoint SSD, for the Gaussian and Uniform workloads. 
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Figure 5: Execution times of range queries for three different 
datasets. 
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5 CONCLUSIONS 


In this paper we put effort to highlight the opportunities that new 
or upcoming non-volatile memory technologies create for data 
indexing. We studied the performance of spatial indexes exploiting 
3DXPoint NVM as secondary storage and we introduced H-Grid, 
a spatial index for hybrid storage. H-Grid detects hot regions and 
persists them in 3DXPoint storage. 

The experimental results show significant performance improve- 
ment for H-Grid in comparison to GFFM, a flash-based Grid File 
variant. Specifically, the gain ranges from 35% up to 43% in the sin- 
gle point retrieval, while the achieved speedup for range and kKNN 
queries is up to 28% and 32%, respectively. Examining the attained 
results from H-Grid, we can infer that tree indexes, like B-trees and 
R-trees can also benefit significantly by storing hot nodes to the 
low-latency 3DXPoint. This gain can be higher in workloads that 
impose small random I/O. 

So, we demonstrated that even small amounts of 3DXPoint in the 
secondary storage layer can accelerate spatial queries performance 
at affordable cost (e.g. a 32GB Optane module costs under 100 USD). 
Our plans for future work in H-Grid include a method for tuning 
the selectivity parameter s based on workload’s characteristics and 
a cooling process for buckets that stay long time in the 3DXPoint 
storage without being accessed. We also intend to study the perfor- 
mance characteristics of tree indexes, like B-trees and R-trees, in 
non-volatile storage. 
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ABSTRACT 


In addition to sensor heterogeneity, monitoring applications must 
handle different temporal data models (e.g time series, event se- 
quences). In this paper, we address the problem of discovering 
directly actionable high level knowledge from such data. We model 
temporal information through interval-based streams describing 
environment states. We propose an approach to discover efficiently 
Complex Temporal Dependencies (CTD) between state streams, 
called CTD-Miner. A CTD is modeled similarly to a conjunctive 
normal form and describes temporal relations (time delays) between 
states. CTD-Miner is robust to temporal variability of data and uses 
a statistical independence test to determine the most appropriate 
time lags between states. This test is also used to perform pruning 
on sub-dependencies checking. Finally, we validate our approach 
via synthetic data and a case study in a real-world smart environ- 
ment using outdoor cameras and real-time video processing. 


CCS CONCEPTS 


* Information systems — Data streams; Temporal data; Sen- 
sor networks; Data analytics; Data stream mining. 
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1 INTRODUCTION 


Temporal knowledge discovery is an important task for a grow- 
ing number of application domains where large volumes of time- 
stamped data can be generated. Smart environments are a typical 
example of such contexts. They refer to places or objects equipped 
with a sensor system monitoring one or more physical measures 
through data streams. These make it possible to obtain temporal 
description of the environment's characteristics evolution that is 
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Figure 1: A smart environment 


induced by temporal phenomena occurring within it. In this con- 
text, a knowledge discovery task consists in extracting non-trivial, 
expressive and concise patterns describing typical hidden temporal 
phenomena. 

We describe in Fig. 1 several one-way corridors used by actors of 
two types: pedestrians and cyclists. This environment is equipped 
with a sensor system composed of door sensors and video cameras 
monitoring parts of the corridor. For data produced by video cam- 
eras, advances in image and video processing make it possible to 
obtain useful insights: motion detection, counts, object recognition. 
In this example, C1, C3, C4 and C5 provides motion detection, C2 
provides object recognition and C6 counts moving objects. Each of 
these sensors provides a data streams to a monitoring application. 
Examples are shown in Fig.2. Our objective is to extracts temporal 
knowledge from such configurations. 

Temporal descriptions of sensed environment are richer if dif- 
ferent types of features are used. In the example described higher, 
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MB : Motion Begin O : Opening door H: Human 
ME : Motion End C: Closing door B : Bicycle 


Figure 2: Some raw data streams gathered from the sensor 
system depicted in Fig.1 
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the analysis of motion information would permit to obtain 3 main 
trajectories "C1 then C3", "C1 then C4" and "C1 then C5". In this 
example, C3 and C4 can only be reached by pedestrians and C5 
by cyclists. A more valuable insight is given by the heterogeneous 
relation "C1 then <Bicycle> then C5". The usage of heterogeneous 
descriptions for sensed environments offers the opportunity to dis- 
cover temporal knowledge reporting on complex relations between 
different features of the environment. This poses a challenging 
problem to temporal pattern mining approaches: 


How to obtain complex, non-trivial and directly actionable temporal 
knowledge starting from heterogeneous sensor raw data? 


Each sensor may use the most appropriate time model (time 
points or intervals) and data model (e.g numeric time series, sym- 
bolic events or item sets) w.r.t its physical measure. A simple way to 
address heterogeneity is to transform raw sensor data sequences toa 
unified general model using a Temporal Abstraction (TA) operation. 
In [9] authors defined TA as "the segmentation and/or aggregation 
of series of raw (...) data into a symbolic time-interval series represen- 
tation, often at a higher level of abstraction (...), suitable for human 
inspection or for data mining". In addition to solving heterogeneity 
problems, TA permits to build high level pattern vocabulary that it 
more suitable for human perception and interpretation [5]. In this 
work, we use an interval-based representation built on states refer- 
ring to data configurations of interest for the application domain. 
States are defined via predicates on one or more sensors producing 
data. A state stream contains parts of time (intervals) where the 
state’s predicate is valid. This way, data provided by a sensor system 
is transformed to a set of unified interval-based state streams. 

Time intervals allow to take into account straightforwardly du- 
ration in a discrete time contrary to point based events. Therefore, 
using a point-based approach to process interval data induces a loss 
of information and expressiveness. As discussed in [1], 13 relations 
can exist between two intervals. Existing interval-based qualitative 
pattern models, based on all or a subdivision of Allen’ logic, may 
suffer from various expressiveness issues as ambiguity (a pattern 
may lead to different temporal relations) or completeness (capability 
of expressing all possible relations) [5]. In some extent, quantita- 
tive patterns maintaining temporal information permit to solve 
these problems as temporal relations are explicitly characterized 
permitting to infer Allen’ logic. Moreover, time lags can also be a 
discriminant factor. In the example described higher, the trajectory 
"C1 then C2" can be performed by pedestrians and cyclists. These 
two types of actors performs the same qualitative trajectory but 
with different temporal information: cyclists "C1 then C2 after d 
units of time" are faster than pedestrians "C1 then C2 after d’ units 
of time", where d < d’. This can be often useful, for example, to 
perform forecasting: "if C1 then C2 after d", one can predict that the 
following trajectory step is C5 (rule corresponding to cyclists). 

Very few existing approaches tackled directly quantitative inter- 
val patterns in streams or can easily be adapted to this task ([11], 
[4]). In [11], authors introduced a novel form of temporal relations 
based on the assessment of intervals intersection. The intersection 
of a pair of state streams contains time portions where both states 
are active. The length of this intersection is used to assess statis- 
tically the significance of the temporal correlation. In our work, 
this model is extended to handle more complex relations involving 
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multiple states. We want to express relations including disjunctions 
and conjunctions involving multiple data streams, like "IF A then 
(B or C) after a duration d". 

In this work, we firstly introduce the Complex Temporal Depen- 
dency model (CTD), a quantitative pattern model as an extension of 
pairwise dependencies proposed in [11]. It is automatically assessed 
using a x? test of independence on interval intersection length. 
Next, we propose CTD-Miner, an algorithm devised to discover 
efficiently CTDs. It includes a novel time lag discovery method, 
sub-dependencies pruning techniques and dependencies merging. 
Finally, we conducted experiments using both simulated and real 
life motion sensor data to validate our approach. 


2 RELATED WORK 


Sequential and temporal pattern mining is a well studied research 
area designed to discover regularities among temporally ordered 
data. Contributions in this field can be categorized following three 
main criteria: data format (transaction databases or streams/single 
sequences), time models (time points or intervals) and temporal 
description (qualitative or quantitative). 

Most of existing approaches consider input data to be in a trans- 
actional format: each transaction is a temporally ordered sequence 
associated with an ID. A typical example would be medical data: a 
sequence describes medical history of a particular patient identified 
by an ID. Transactional format makes clear separations between 
activities (e.g between patients medical records) and can be called 
"subject-centered". With this data format, temporal pattern mining 
task consists on finding temporal regularities between transactions. 
On the other hand, sensor streams (or single sequences) are a contin- 
uous flow of data where no boundary exists between activities and 
describe the evolution of physical measures. For example, a motion 
sensor reports on motion rather than describing motion of objects 
separately: a same piece of data can be generated by one or various 
actors activity. Hence, sensor data can be called "measure-centered". 
In both cases, existing contributions use a user-given minimum 
support referring to the part of data where a relation stands. The 
support assessment can be based on number of transaction for 
transactions, time windows [7] or number of items [13]. 

Most of the contributions dealing with streaming data or single 
sequences focus on time points with qualitative patterns reporting 
on before/after/co-occurring relations [7]. Quantitative patterns do 
not permit a time delay discrimination which is rather useful for 
many application domains. While some contributions integrate tem- 
poral constraints (time windows [7], gap constraints [2]), several 
approaches tackled quantifying time delays ([6] [13] [14]). Another 
category of contributions deals with data represented as intervals. 
As discussed in [1] this time model permits to express more complex 
relations (13 relation types in Allen' logic). This form of patterns 
may suffer from expressiveness limitations as ambiguity (a pattern 
may lead to different relations) or completeness (not allowing to 
express all possible relations). We refer to [5] for a detailed analy- 
sis of expressiveness. Besides, qualitative patterns are sensitive to 
temporal variability as slight variation in intervals endpoints may 
lead to different qualitative relations. 

Quantitative patterns permit to deal with these problems as 
they include temporal information permitting to infer Allen' logic. 
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Among the existing contributions [10] [3] [12] only two recent ap- 
proaches are designed or can be easily adapted to interval streams 
[4] [11]. In [4], the authors propose PIVOTMiner that uses a geomet- 


ric approach consisting on the projection of each interval [tpegin; tend) 


into a bi-dimensional plane (begin, end) where the temporal rela- 
tion between two intervals is considered as a vector. This allows 
to perform a DBSCAN clustering after an origin transformation 
stage: to mine relations of type A — B, all vectors with A as source 
and B as target are moved such as all sources coincides with the 
space origin. Cluster centroides provides the time lag information. 
While designed for sequence databases, this approach can be easily 
adapted to data streams as it is not endpoint sensitive (as in [12], 
[3]). Finally, in [11] the authors proposed a novel form of temporal 
relations made possible by the usage of intervals. It is based on the 
intersection of sets of validity intervals corresponding to predicates 
describing the environment states. The authors propose an algo- 
rithm, TEDDY, devised to discover dependencies of type "A — B 
after (a, B) units of time" assessed via a confidence measure. 


3 STATE STREAMS 


3.1 From raw sensor data to state streams 


We define a data stream D as a sequence of time stamped data 
produced by a source d € A, with A the set of data sources com- 
posing a sensor system. Formally, a data sequence Sg produced by 
d is defined with S4 = (((t,v)) | te 7,v € Vy}. T is an infinite 
set of discrete time stamps and t can be either a time point or an 
interval t = [tp, tp + 1). V is the set of possible values given by d. 
We assume that a data source cannot produce more than one value 
at a time. In the example depicted in Fig 1, the set of data sources 
is A = (C1, C2, ...DS1, DS2}. Possible values for the data source C1 
is a set of event labels Vc, = (MB, ME}, for C2 is all subsets of 
possible objects "Vc? = (Oi | Oi € {H, B}}, and for C6 is the set of 
positive integers "Vcg = N. Examples of state streams are provided 
in Fig 2. Notice that data sources can be heterogeneous at different 
levels: sample rates, data models (e.g time series, labeled events, 
item sets), time models (time points or intervals). Our objective is 
to process data streams given by A in order to obtain high level 
temporal knowledge. 

Our approach to solve problems introduced by heterogeneity is 
to use Temporal Abstraction (TA) devised to transform raw hetero- 
geneous data into a unified interval based model using high level 
abstraction providing a "pattern vocabulary" for temporal knowl- 
edge. This technique was often applied for medical records (e.g 
[9]) or with sensor time series (e.g [8]). More precisely, the main 
idea is to use expert knowledge or automatic techniques (e.g time 
series discretization) to define a batch of high level states that can 
be seen as "environment features" of interest. A state is a particular 
environment configuration referring to an non-trivial situation. 
Formally speaking, a state a is defined as a predicate: a : T, A — B. 
The result of a is a boolean value stating that the state is whether 
active or inactive at [t,t + 1) € T given a set of data sequences in 
A. Some examples of states: 


e motionInC1(1, C1) ::= last(t,C1).v = MB 
e increasingOccupency(t,C6) :— last(t,C6).v — last(t — 
1,C6).v > 0 
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(a) Intersection and union 
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(b) (a, B)-transformation 


Figure 3: Examples of intersection and union and (a, /)- 
transformation 


e congestion(t, C1, C6) :- (last(t, C1). = MB)^(last(t, C6).o— 
last(t — 1, C6).o = 0) 
Predicates can be simple conditions on a single raw data sequence 
values as motionInC1 that reports on simple motion activity (the 
last event produced by C1 at the timestamp t is MB-Motion Begin). 
States can also report on data trends as increasingOccupency, or can 
also integrate data from various sensors as for congestion. 

A state stream is a temporally ordered sequence containing all 
time intervals, called active intervals, where its corresponding state 
predicate is verified. A state stream A corresponding to state a is 
formally defined as A = (a(t, A), ([tp,. te;))) such that Vtp,, te; € 
7 ,ty, € te; < tp,,, and Vt € [tp,,te,) | a(t,A) = True with À C A. 
The size of a state stream A noted £A corresponds to the number 
of its intervals. The length of a state stream is the sum of its active 
intervals duration: 


lA) - J, (e-t) 


[tp. te) €A 


For example, the state stream corresponding to motionInC1 from 
Fig. 2 is motionInC1 =< [1, 4), [13, 14) >, with size(motionInC1) = 
2 and len(motionInC1) = 4. 


3.2 Operations on state streams 


We define several operations on state streams: intersection, union 
and temporal transformation. 

The intersection of two state streams A and B, noted A N B is 
a state stream containing intervals where both A and B are ac- 
tive (Fig.3). Formally, AN B = (a ^ b, ([tp,,te;)) such that Vt € 
[tp, . te) Alty,, te;) € A,[tp,,te,) € B such that t € tp; te;) and 
t € [tp, , tep). This operation is computed in O(Max(#A, #B)). 

The union of two state streams A and B, noted A U B, pro- 
duces a new state stream containing the intervals where A or B 
are active (Fig. 3). Formally, AU B = (a ^ b, ([tp,, te;)) such that 
Vt e [tp,. te;). Alto, te;) € A, [tp, tep) € B such that t € [tp,. te;) 
ort € [tby te, ). Similarly to intersection, the union of two state 
streams has a time complexity of O(Max(#A, #B)). 

A state stream B can also be temporally shifted via an (a, f)- 
transformation. This operation results on a state stream B® Ê) = 
([tp, -@, te; — B) | [tp;. te;) € B). Hereafter, æ is called expansion and 
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f reduction. We describe in Fig. 3 two examples: a (0, 2)-reduction 
B®? and a (2, 0)-expansion B0. This temporal transformation 
is done in Q(#B). 


4 COMPLEX TEMPORAL DEPENDENCIES 


This section introduces the Complex Temporal Dependencies (CTD) 
model that aims to describe temporal correlations between multiple 
state streams on the basis of their intersection length. 


4.1 Background 


A pairwise temporal dependency between state streams A and B, 
noted A — B, notifies that A occur simultaneously with B. We call 
hereafter A the premise and B the conclusion. A — B is assessed 
with the intersection length of A and B via the following confidence 


measure: 
len(A n B) 


len(A) 
Notice that confidence is maximal (71) if all active intervals of A 
are included in active intervals of B: A occurs always with B. For 


conf(A — B) = 


example, in Fig. 3 conf (A — B) = =. 

In order to find relations of interest, and to avoid the utilization 
of an user-given threshold, this confidence measure is statistically 
assessed via a Pearson y? test of independence. The independence 
hypothesis states that A and B are statistically independent within 
a duration 7 . It relies on the following assumption: if active length 
of B is uniformly distributed in 7 , there is no significant correlation 
between A and B. The given validity threshold on confidence is 
noted th(len(A), len(B))! and obtained for a significance level of 
0.05 and 1 degree of freedom [11]. 

Simultaneity do not permit to express dependencies when it 
happens that B is time-delayed regarding A. The conclusion stream 
can be temporally shifted with an (a, f)-transformation to obtain a 
relation of type A > B(4:/), Therefore, the associated confidence 
value is given by: 


len(A N B® P)) 


conf(A > B(«.P)) = Ten(A) 
en 


A dependency A — B(*:P) means that: B starts at most a time units 
after A and B finishes at least f time units after A. For simplicity 
purposes, we refer hereafter to this relation with: if A is active then 
BP) is active. 


4.2 Conjunctive and disjunctive relations 


As previously described, pairwise state dependencies are composed 
of a premise, a conclusion and an inference operator "—>" that is 
associated with a confidence measure and describes simultaneity 
and succession relations. Our goal is to express dependencies with 
multiple states. 


3 
In Fig.4 pairwise dependencies are A > B23), conf = A and 


6 
B — C2), conf = =. In this example, C always follows a succes- 
sion "A then B": a dependency A followed by B then C should have a 


Ip*lc* NEM * le(Tops — le)(Tobs — lp) 


Tops * lp 


1 th(Ip, Ic) = 
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Figure 4: Example of 3 correlated streams 


confidence of 1. This example permits to emphasize that dependen- 
cies between multiple states must be able to express conditional 
relations like: state C is correlated with state B if B is preceded by 
state A after a duration d. This is made possible by the introduc- 
tion of the conjunctive operator ^ that expresses simultaneity or 
succession, but contrary to the inference it is not associated with a 
confidence measure. A conjunction of two state streams A and B, 
noted A^ B, corresponds to the intersection AN B. This conjunctive 
stream can be seen as the stream representation of A — B and 
be used to extend this dependency. In the higher example, we can 
extract the following conjunctive dependencies: 


An B?9 —, Co) conf =1 (1) 
3 
A= B®) a Co) conf = (2) 


These two dependencies are to be interpreted differently: (1) If state 
A and B®) are active then C? is active with a confidence of 
1; (2) If A is active then B®3) and C5) are simultaneously active 


3 
with a confidence of —. 


A more expressive form of dependencies is made possible by 
the introduction of disjunctive relations, noted V. Let us consider 
temporal configuration of states described in Fig 5. Using con- 
junctive relations allows us to obtain the following dependencies: 
AAB>CAD,AAB—> E,AAB—> FAGAH,AA^B — FAGAI. 
This states configuration can be concisely described with a unique 
dependency using disjunctive relation stating: 


IF A AND B are active 
THEN (C AND D are active) OR 
(E is active) OR 
(F and G and (H OR I) are active) 


The corresponding dependency is noted as follows : 
AA pev Bo) —(c(ae Be) ^ pta Pay 
pae Bey 
(FGr- Br) n Git. Pa) A (Eth B) y fGti By) 


Figure 5: A temporal configuration of states. A is followed 
by B, Bis followed by C or E or F etc... 
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Notice that the premise and the conclusion of this dependency 
are both state streams. It suffices to compute, for each dependency 
part, conjunctions (stream intersection), then disjunctions (stream 
union) to obtain a single state stream representative. This allows 
us to compute the confidence value following the same principle 
as for pairwise dependencies. 

Complex Temporal Dependencies (CTD) are defined as depen- 
dencies including conjunctive and disjunctive temporal relations: 


Definition 4.1. Complex Temporal Dependency. 
Let S = (A1.A2,..., An) be a set of state streams. A Complex 
Temporal Dependency (CTD) over S is defined with: 


Do AD, A... A Dk > Dyja4 ^... ^ Deom 


with Dj a state stream resulting of disjunctive and conjunctive 
operations using a subset S; C S such that Vi,j S; N Sj = 0. 
Temporal transformations are defined with respect to a stream 
from Dj having a (0,0)-temporal transformation. 


4.3 Sub-dependencies and correspondence 
relationship 


Conjunctive relations make it possible to obtain dependencies be- 
tween multiple states reporting on "large" significant temporal 
correlations. As a consequence, a same temporal phenomenon in- 
volving n states can be described entirely with a single depen- 
dency or partially with multiple smaller dependencies, called sub- 
dependencies. We describe in Fig.6 three state streams A, B and C 
producing the following dependencies: 


A — B®? conf - 0.5 (3) 
B — CC? conf =1 (4) 
A 2 C(*9 conf =1 (5) 
A^ BY?) — C44), conf =1 (6) 


In this example, dependencies (3) and (4) are sub-dependencies 
of (6): intervals involved in (3) and (4) are included in (6). On the 
other hand, dependency (5) is not a sub-dependency of (6): half of 
intervals of A and C do not intervene in (6). The ability to detect 
sub-dependencies is a key feature in a dependency discovery pro- 
cess that can be compared to closure-checking for support-based 
algorithms. In addition to reducing the amount of redundant in- 
formation (pattern flooding), the sub-dependency property can be 
useful to define pruning criteria to accelerate a discovery process. 
Sub-dependency checking can be done via the study of conjunction 
intersection via the following correspondence relationship: 


Definition 4.2. Correspondence relationship 
Let A and B be two state streams. A and B are corresponding if 


^ E BH O E 

B E E 

C E E E E 
juo m aa ee 


Figure 6: Dependencies A — B@2) and B — C2) are in- 
cluded in A A B@2) — C44, A — C44 is not. 
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conf(A — B) > 1—eandconf(B— A) >1-€e 


where e is a relaxation parameter such that e « 1. 


Correspondence between A and B permits to assess whether two 
state streams are exclusively co-occurring: A occur "almost" always 
with B and inversely. The correspondence checking is referred 
to as a boolean function Corr : A, B,e — B. The e parameter 
allows an error rate in the correspondence checking to tackle noisy 
or temporally variable relations. This parameter can be defined 
with respect to the statistical threshold on confidence measure. 
It corresponds to the minimum intersection length that can be 
considered as statistically significant. The loss of a smaller amount 
w.r.t the maximal confidence value (71) can be considered as non- 
significant. 

The sub-dependency checking between two dependencies Di 
and D» comes to evaluate if the representative conjunctions of both 
dependencies are correspondent. In the higher example, A A B®?) 
and AA B2 a C(^9 are correspondent with maximal confidence 
values: 


conf (A ^ B9 _, AA BA?) 4, C44) 21 
conf(A ^ BR) , C449 AA C(2)) zu 


On the other hand, A^ C44 and A ^ B(22 4 C(*9 are not corre- 
spondent: 


conf(A ^ C(*9 —, An B®) 4 c(59) = 9.5 
conf (A ^ B9 4 C449 _, An C44) = 1 


In these examples, the temporal reference stream is A (all tempo- 
ral transformation are expressed w.r.t to intervals of A) for both 
dependencies. Otherwise, a temporal transformation and an in- 
tersection are applied. For example, a (2, 2) transformation is ap- 
plied to BA C2 as B is shifted with a (2, 2)-transformation in 
A ^ BÈ?) — C(*9. The sub-dependency checking is done w.r.t 
the intersection of this transformed stream and A (in this case: 
A ^ BR?) 4 C2 and An (B A C2) 2), 


Definition 4.3. Sub-dependency Let S = (A1, A2, ..., An} bea 
set of state streams, R1 a CTD with streams in Sı C S and Rz a 
CTD with streams in S2 C S. Then Ag, Ag, are respectively the 
temporal reference of Rj and R2. Ap, have a (a, f)-transformation 
in Ry. R2 is a sub-dependency of Ri if S2 € S; and: 


Corr(R1, Ag, N RP) e) = True 


5 PROBLEM DEFINITION 


The proposed dependency model defines a search space that can 
be characterized in spatial and temporal dimensions. The temporal 
dimension contains all possible temporal transformations (a, f). Its 
total size for a discrete observation duration J is |7 |? as a € T and 
pf € T. The spatial dimension includes all combination between 
state streams and dependencies operators (inference, conjunctive 
and disjunctive relations) in addition to the temporal dimension 
for all streams. This defines a very large search space even for a 
small set of state streams. If only conjunctive relations are taken 
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into account, the search space size is given by: 


n n! k 
5 ——__ s (k-1)* |T|? 
za (n — K)! 


with n the number of input streams. The number of conjunctive 
relations without temporal transformations is given by the number 
of arrangements of 2 to n elements from a set of n state streams. 
The factor (k — 1) stands for the possible inference operator position 
within a dependency. For example, with A, B and C, one can obtain 
AA B —> C or A —> BAC which are two different dependencies. 
Moreover, for each state in a conjunctive dependency, there are 
|7 |? possible streams to be taken into account. This search space 
size shows that performing a naive exploration is not affordable. 
Besides, this number is a lower bound of the total search space size 
as it does not take into account all possible disjunctive relations. 

In order to perform a feasible CTD discovery, we assume that is 
not useful to look for dependencies with large time gaps between 
states (similar to a window size constraint in [7]). Therefore, we 
limit the allowed time lag to an interval A = [min, max] such that 
a, B € A defining a quadratic search space of size |A|? for each 
stream. In addition to this temporal constraint, we limit our search 
space to dependencies with the following form: 


So A S1 A S2... — Sk 


We consider that such dependencies provide sufficient insights to 
describe temporal phenomena. For example, for a state succession 
A, B then C, obtaining A — B, conf; and A ^ B —> C, con fz permit 
to characterize this temporal phenomenon at each of its steps is 
associated with a corresponding confidence measure. This kind of 
characterization can be represented as a directed acyclic graph: 


conf, conf, 


A - Bre cC 


In this example, node A corresponds to state stream A, B to con- 
junction A ^ B and C to AA B ^ C. Edge corresponds to confidence 
measure: conf; = conf(A — B) and conf; = conf(A ^ B C). 
This representation permits to obtain a condensed representation of 
a given temporal phenomenon including temporal transformations. 

Moreover, we assume that statistically valid disjunctive relations 
can be obtained only from statistically valid conjunctions. In other 


conf, conf, conf; 

A B [—— C po D 
conf, conf; conf, 

A pL— Bo o— Cro E 
conf, conf; 

A L— Bre EF 


(a) Conjunctive temporal relations 
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(b) Corresponding disjunctive form 


Figure 7: 3 conjunctive temporal relations and correspond- 
ing disjunctive form 
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words, if A is temporally correlated with B v C, A is correlated with 
B and Ais correlated with C. For example, in Fig. 7.a one can notice 
that the conjunctive relation A — B is common to the 3 temporal 
relations. The idea is to perform of a factorization of dependencies 
common conjunctive relations using the correspondence relation- 
ship. The resulting disjunctive dependencies can be represented as 
a tree. With the former example, the corresponding tree represen- 
tation is depicted in Fig. 7b. Notice that all information provided 
by the three conjunctive relations are kept by this disjunctive tree 
representation. Therefore, this paper tackles the following problem: 
Given a set SS of state streams and a temporal constraint on time 
lags A = [min, max], extract all conjunctive Complex Temporal De- 
pendencies of the form So ^ S1 ^ S2... > Sy and build disjunctive 
tree representations of corresponding temporal phenomena. 


6 CTD-MINER 


CTD-Miner is based on an incremental approach consisting on 
building dependencies with i + 1 conjunction from previously com- 
puted dependencies with i conjunctions (Algorithm 1). 


6.1 CTD incremental construction 


The incremental construction of conjunctive relations starts with 
considering all streams in S as premise candidates (line 2). In the 
main loop (line3 to 21), each pair of premise candidate p and stream 
s € S are tested via a time lag discovery algorithm (line 13). Notice 
that a state label is allowed to appear only once in a dependency 
via the non-cyclic condition in line 12 (the conclusion state label 
must not appear in the premise). Results given by significant time 
lag discovery are used to extend p in ExtendDependency (line 16) 
that creates a new premise candidate p ^ r. In order to be able to 
reconstruct the graph representation, the extension of dependency 
stores previous confidence values and temporal transformations in 
addition to conjunctive streams. If no results are given by the time 
lag discovery algorithm, p is added to the results set: p is no longer 
extendable. New premises at a given iteration are considered as 
premise candidates for the next iteration (line 21). The loop ends 
when no premise candidates are left. The non-cyclic condition (line 
12) guarantees the termination of the main loop. 

The disjunctive relations constructions constitute the final step of 
CTD-Miner. The results obtained after the main loop are composed 
of conjunctive relations. We recall that conjunctions resulting at 
each extension step are stored. Disjunctive dependencies are built 
using the correspondence relationship if two dependencies Dı and 
D» share their k first conjunctive relation labels and the resulting 
conjunction are correspondent their are merged. Finally, the result 
set R is returned (line 24). 


6.2 Correspondence relationship based 
pruning 

The above steps constitute the baseline version of CTD-Miner that 
implements the incremental construction of conjunctive relations. 
The resulting dependencies may contain sub-dependencies that 
are considered as redundant information. For example, let us con- 
sider a temporal phenomenon, called A, constituted of n temporally 
consecutive states A; — A2»... — Ag. The results set will contain 
dependencies describing all or a subdivision of this phenomenon, 
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Algorithm 1: CTD-Miner 
Input: S : a set of state streams 
T : observation duration 
A = [min, max] : temporal constraint on time lags 
Output: R : set of CTDs 
i158 —0 
2 premises — S 
3 while |premises| > 0 do 


4 newPremises — () 

5 for p € premises do 

6 extended — False 

7 process — True 

8 if |p| » 1 then 

9 process + PrePruning(p, newPremises U R) 
10 if process then 

1 for s € S do 

12 if s.label € p.labels then 

13 results — TimeLagDiscovery(p, s, A, J) 
14 for r € results do 

15 extended — True 

16 ext — ExtendDependency(p, r) 
17 Add ext to newPremises 

18 if |p| > 1 and not extended then 

19 Add p to R 

20 newPremises —— MergeDependencies(newPremises) 
21 premises —— newPremises 


22 R — PostProcessing(R) 
23 R — BuildDisjunctions(R) 
24 return R 


An-1 — An, An-2 ^ An-1 > An .. At A Ag... ^ An-1 > An. 
Besides, for each of the previous dependencies, the conjunction 
relations order can differ while maintaining temporal information: 
A1 Ad es => gia) and A4 dp => Ans P corresponds 
to the same temporal succession of states. Therefore, the total num- 
ber of resulting conjunctive dependencies describing this n-states 
temporal phenomenon is given by: 


n n-1 
IRla = »«- l= 2 k! 
k=2 k=1 


In order to keep the number of solutions moderate, CTD-Miner 
benefits from a sub-dependency checking at two levels. First, as a 
pre-pruning step (line 9) via the PrePruning function that performs 
a sub-dependency checking of a premise p w.r.t dependencies in 
newPremises and R. If p is a sub-dependency of another, it is pruned: 
PrePruning returns false and p is not processed nor added to the 
result set. Second, to ensure that no sub-dependencies are left in 
the final results, a post processing step (line 22) is performed. 
Another way to benefit from the correspondence relationship 
is that of merging dependencies (line 20). This operation consists 
of extending dependencies with other dependencies rather than 
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a single stream. For example, A — B(ct-D») and B — C can be 
merged if A ^ BP») and AN (B ^ CP») are correspondent. 


6.3 Time lag discovery 


The discovery of temporally quantitative correlation between two 
state streams is a core operation in CTD-Miner. This operation 
is executed at a high frequency during the discovery process. As 
a consequence, its time complexity influences significantly the 
performance of CTD-Miner. 

In order to discover such correlations, we designed the ITLD 
(Interval Time Lag Discovery) heuristic algorithm that we describe 
briefly in this section. It performs a linear significant time lag dis- 
covery w.r.t A. The main idea of this algorithm is to detect statistical 
significant confidence gains and losses. More precisely, given a pair 
of state stream A and B, for each i « € A, ITLD computes the 
following confidence variations: 


gain = conf(A > pee - conf(A ^ petia) 
loss = conf(A — B^?) — conf (A > BO **) 


The confidence gains and losses are calculated w.r.t a non deformed 
conclusion streams i.e with (a, a)-transformations. This permits 
to guarantee that intervals of the conclusion B cannot be merged 
or canceled due to an important deformation induced by temporal 
transformations, which constitute a loss of transition information. 
These confidence variations are qualified as elementary as they are 
obtained by adding a unit to whether the expansion (for gains) or the 
reductions (for losses). As a consequence, the conclusion gained/lost 
length corresponds to number of intervals of B. The statistical 
assessment of these elementary confidence variations is done with 
the validity threshold th(len(An B(4: 9). &B) is it aims to evaluate 
whether the confidence variation is statistically independent w.r.t 
the gained/lost conclusion length. We describe in Fig. 8 the entire 


0.8 


e 
o 
confidence 


0.4 
0.2 
25 0.0 
20 15 5 0 
Xpan; 10 10 
Sin, 5 15 action 
n o 95 20 ed 
(a) Temporal search space 
g 0-200 J g 01001 
$$ 00754 i$ 0.075 4 
s E 
> 0.050 4 2 0.050 
8 E 
E 0.025 4 E 0.025 
0.000 +—e e 
0.000 3—e— ——e : 
10 20 0 10 20 
a B 
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Figure 8: Temporal search space (a) and resulting confidence 
variations for a pair of dependencies with two temporal re- 
lations (13, 13) and (4, 4). A = [0, 25] 
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search space (a) and the corresponding confidence gains, losses and 


thresholds. 


7 EXPERIMENTS 


This section presents the experimental results that illustrate the 
efficiency of CTD-Miner. We experimented our approach using 
synthetic data provided by a motion simulation tool that makes it 
possible to obtain data corresponding to different scenarios. We 
also study the behaviour of CTD-Miner with a real world motion 
data set generated from a sensor system composed of outdoor video 
cameras and using real time video processing. Algorithms were 
implemented in Python? and tested on a Core i7 2.1Ghz with 8GB 
memory running Windows 10. 


7.1 Simulated Data 


We developed a motion simulation tool? in order to obtain data 
sets using custom scenarios (trajectories, activity density, speed of 
moving objects, temporal variability). This controlled testbed gen- 
erates data sets with a known ground truth permitting to evaluate 
the accuracy of discovered temporal dependencies in a multitude 
of configurations. 

Table 1 describes datasets obtained from the simulation tool. 
We defined a linear trajectory with 10 equidistant sensors and ran 
11 simulations varying the number of occurrences for the same 
duration 7 = 10000 and object speed (cf Table 1). Each dataset 
contains 10 streams and the typical time lag between successive 
state is x (4, 4). The number of intervals increases with the num- 
ber of event occurrences for sparse data (100 to 5000 occurrences) 
and decreases for high density event occurrences due to intervals 


overlap. The streams active length always increases when the event 
len(stream) 


Tobs 
from 1.7% to 78.5%. It is to notice that the expected result from these 
datasets is the trajectory description including 10 states with the 


occurrences increases and have a density ( ) ranging 


corresponding (a, f)-transformations (represented as a 10 nodes 
graph). 

Hereafter, we evaluate our approach with respect to size of A, 
density of streams intervals and scalability. We also examined the 
impact of sub-dependencies pruning, dependencies merging and the 
time lag discovery algorithm on CTD-Miner's efficiency. To this end, 
we consider the following CTD-Miner configurations: Baseline: 
consists of the simple incremental construction of CTD as described 
in section 6.1; Pruning: line 20 is removed from Algorithm 1 and 
only the sub-dependencies pruning is performed ; CTD-Miner: 
CTD-Miner as described in Algorithm 1. ITLD was used for the 
three previous algorithm. The fourth CTD-Miner configuration is 
CTD-Miner-TEDDY with TEDDY [11] instead of ITLD. For the 
following tests, we limited the execution of the CTD discovery 
algorithms to 20 minutes and used the statistical threshold for the 
correspondence relationship. 


7.1.1 Influence of the size of ^. Fig.9 reports the running time and 
number of conjunctive relations (before disjunctions construction) 
for the 1000 occurrences dataset when A = [0, max] varies. 


?https://github.com/AElOuassouli/Quantitaive-Interval-Stream-Mining 
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Figure 9: Running time and number of results of CTD-Miner, 
Pruning, Baseline, CTD-Miner-TEDDY w.r.t to max 


CTD-Miner-TEDDY and Baseline reached the 20 minutes run- 
ning time limit for respectively max = 10 and max = 20. In both 
cases, the number of premise candidates, and by extension the 
number of time lag discovery executions, increases significantly. 
This can be noticed with the number of the given conjunctive rela- 
tions. TEDDY outputs a non-negligible amount of false positives 
w.r.t the simulation scenario. For example, with A = [0,1] TEDDY 
obtained 7 dependencies when none is expected. Candidate sub- 
dependencies increases significantly for the Baseline as the number 
of given conjunctive dependencies shows. 

CTD-Miner and Pruning were capable of completing the dis- 
covery process for all values of max € [0,100] with an advantage 
for CTD-Miner. Results for Pruning corresponds to dependencies 
with time lags in a given A = [0, max] while CTD-Miner returns a 
unique dependency corresponding to the expected 10 conjunctions 
dependency for max > 5 (except for 10 and 25 where A did not 
permit to fully capture the intersection for a temporal relation with 
a@ = max + 1 and f = max + 1, causing the non-verification of the 
correspondence relationship). It is to notice that Pruning obtains 
the same result for max > 35. Thus, thanks to the sub-dependency 
pruning and the dependency merging CTD-Miner scales well with 
respect to A. 


7.1.3. Influence of density of streams. As described in Table 1, ac- 
tivity density (number of occurrence) comes generally with an 
increase of length and intervals number. Therefore, a complex tem- 
poral dependency discovery algorithm must be capable of scaling 
with respect to intervals numbers and be able to detect precisely 
temporal phenomena in both sparse and dense state streams. 

Fig.10 reports the empirical study of CTD-Miner with respect 
to streams density. The constant number of conjunctive results 
given by CTD-Miner with the ITLD (CTD-Miner, Pruning and Base- 
line) for each value of max shows that ITLD is robust to density of 
streams and allows CTD-Miner to scale well with respect to inter- 
vals number for these datasets. This is not the case for CTD-Miner- 
Teddy due to its time lag discovery algorithm (that 20 minutes limit 
was reached for 2000 occurrence with max = 5). 


7.1.3 Scalability. One important aspect of CTD discovery is that 
of the ability of processing large numbers of streams. We generated 
a data set of 90 streams following the same principle as data sets 
described in Table 1 with J = 10000 time units, 1000 occurrences 
and a time lag of ~ (6, 6) between successive states. Fig.11 shows 
that CTD-Miner was able to process 90 streams (containing 63990 
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Table 1: 11 simulated data sets with increasing occurrences number (Occ). £Int: average number of intervals per stream, Den: 


average density % of Tops 


Occ | #Int | Den Occ | #Int | Den 


Occ | #Int | Den Occ | #Int | Den 


100 95 1.7 2000 | 1275 | 30.3 


5000 | 1616 | 58.6 8000 | 1386 | 74.6 


500 | 452 8.7 3000 | 1509 | 41.5 


6000 | 1570 | 64.7 9000 | 1263 | 78.5 


1000 | 798 | 16.4 4000 | 1606 | 50.5 


7000 | 1493 | 69.9 
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Figure 10: Running time and Number of conjunctive results 
with respect to occurrence number with different values of 
max 


intervals) within the 20 minutes time limit. CTD-Miner returned a 
unique dependency describing the entire trajectory, with all avail- 
able streams, for all input sizes thanks to its merging procedure 
contrary to Pruning which is limited by the temporal constraint 
A = [0,9] and returns an increasing number of pairwise dependen- 
cies. Notice that for this particular A, the performances of Pruning 
are equivalent to Baseline as the sub-dependency pruning is not per- 
formed. CTD-Miner-TEDDY reaches the 20 minutes for 30 streams 
and returns the trajectory description in addition to a large number 
of dependencies that are considered as false positives w.r.t to the 
simulation scenario. 
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(a) Running Time (b) Number of conjunctive relations 
Figure 11: Running time and number of conjunctive rela- 
tions w.r.t to number of streams. A = [0,9] 


7.2 Reallife motion data 


We describe in Fig. 12 and Fig. 13 a sensor system composed of 
4 outdoor cameras situated in an office area. These cameras are 
capturing motion using real time video processing. Starting from 
images taken by these cameras, we defined 10 "virtual" motion 
sensors (displayed with red polygons) corresponding to physical 
regions and labelled as 1-1, 1-2, ..., 4-3. Each virtual motion sensor 
produces a sequence of time point events of two types: "B: Mo- 
tion Begin" and "E: Motion End". We defined for each area X an 


Figure 12: Four outdoor camera views. Red contours de- 
scribe motion analysis areas. 


Figure 13: Position of motion analysis areas (aerial view) 
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Table 2: Dataset corresponding to the experiment described in Fig.13 


Name | #Int | Len Name | #Int | Len Name | #Int | Len Name | #Int | Len 
1-1 1403 | 4303 2-1 8640 | 47881 3-1 3909 | 14644 4-1 9686 | 30257 
1-2 851 | 3257 2-2 2099 | 4947 3-2 3699 | 13423 4-2 4578 | 13397 

2-3 8548 | 26847 4-3 9825 | 21273 


environment state Motion in area X noted M-X and defined with 
M-X(t,X) ::= last(t, X).v == B. The 10 resulting state streams are 
described in Table 2. This data set describes motion activity in the 
office area during 18 working days between 6 am and 8 pm. First 
observations showed that it contains a significant amount of noise 
(e.g detection of shadows, sudden luminosity changes) and sev- 
eral omissions were observed (e.g when a car passes through an 
analysis zone with a great speed). Moreover, the resulting streams 
are extremely sparse and low statistical thresholds impacts the 
correspondence relationship. As a consequence, we used € = 0.2 
as relaxation parameter. The low statistical thresholds impacts to 
some extent the precision of ITLD: thresholds were in some cases 
lowers than significant confidence variations. In our experiments 
we used the statistical thresholds ! with a significance level of 0.005 
(critical value of Xo ws — 7.88 instead of eae = 3.84) to obtain 
more precise results. 

The execution of CTD-Miner with A = [0,5] was completed in 
300 seconds and provided 12 disjunctive relations. Figure 14 pro- 
vides two examples that illustrate two behaviours: entering office 
area (a) and leaving office area (b). What is to notice is that the first 
dependency was built using the incremental construction of CTDs 
all time lags are included in A and the second using dependencies 
merge. Moreover none of the given dependencies provides false in- 
formation (e.g relation between 2-1 and 3-1). We emphasize the fact 
that the discovery process was completed in a negligible amount 
of time in comparison with the observation duration. 


8 CONCLUSION AND FUTURE WORK 


We proposed an approach to discover complex temporal depen- 
dencies (CTD) starting from interval based state streams. These 
streams are built through temporal abstractions on heterogeneous 
data. Each temporal abstraction (state) is defined using a predicate 
defining a state of interest for the application domain. We intro- 
duced the Complex Temporal Dependencies model that models 


M-4-2(9:0) 
0.24 


(a) Entering office area 


M-3-1(0:9) M-3-2(5:2) 


(b) Leaving office area 


M-4-1(15.5) 


Figure 14: Example given by CTD-Miner on real world data 
generated by environment depicted in Figure 13 
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temporal relations between state streams and their typical time 
delays using conjunctive and disjunctive relations. We also pro- 
posed CTD-Miner, an incremental algorithm that is devised to CTD 
discovery. To validate our model and the discovery process, we 
conducted several experiments on both simulated and real motion 
sensor data. From these results, we conclude that CTD-Miner speeds 
up the exploration process using its linear time delay discovery, 
its sub-dependency pruning and dependencies merge. The encour- 
aging results given for the real data set show that it is possible to 
integrate video analysis methods in a data analysis process which 
opens perspectives for a wide range of application scenarios. For 
example, in a commercial context, our approach may permit to 
investigate temporal dependencies between user-given streams as 
commercial results or client satisfaction rate, sensory data streams 
as motion, people counting, and video processing-based streams as 
objects classification (adults, stroller). 

In a future work, we intend to design an on-line version of CTD- 
Miner in order to obtain up-to-date models of temporal phenomena 
occurring within a sensed environment. This would permit to in- 
vestigate the ability to detect behavior drifts (emergence of new 
dependencies) in addition to the forecasting capabilities of our 
intersection-based dependency model. 
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ABSTRACT 


'The advantages offered by the presence of a schema are nu- 
merous. However, many XML documents in practice are not 
accompanied by a (valid) schema, making schema inference 
an attractive research problem. The fundamental task in 
XML schema learning is inferring restricted subclasses of 
regular expressions. Most previous work either lacks support 
for interleaving or only has limited support for interleaving. 
In this paper, we first propose a new subclass Single Occur- 
rence Regular Expressions with Interleaving (SOIRE), which 
has unrestricted support for interleaving. Then, based on 
single occurrence automaton and maximum independent set, 
we propose an algorithm ?SOIRE to infer SOIREs. Finally, 
we further conduct a series of experiments on real datasets 
to evaluate the effectiveness of our work, comparing with 
both ongoing learning algorithms in academia and industrial 
tools in real-world. The results reveal the practicability of 
SOIRE and the effectiveness of iSOIRE, showing the high 
preciseness and conciseness of our work. 
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1 INTRODUCTION 


XML schemas have always played a crucial role in XML 
management. The presence of a schema for XML documents 
has many advantages, such as for query processing and opti- 
mization, development of database applications, data inte- 
gration and exchange [15, 18, 34, 42]. However, many XML 
documents in practice are not accompanied by a (valid) 
schema [3, 6, 25, 36, 37, 41], making schema inference an 
attractive research problem [2, 5, 7, 13, 17, 22, 30, 32, 43]. 
Studying schema inference also has several practical moti- 
vations. Schema inference techniques may be extended to 
schema repairing techniques [25]. Besides, schema inference is 
also useful in situations where a schema is already available, 
such as in schema cleaning and dealing with noise [7]. 

The content models of XML schemas are defined by regular 
expressions, and previous research has shown that the essen- 
tial task in schema learning is inferring regular expressions 
from a set of given samples [2, 5, 7, 9, 13, 17, 22, 30, 32, 43]. 
In fact, in some cases these learned regular expressions can 
directly be used as parts of the schema, and in other cases 
the inference of regular expressions is the most important 
component of the schema inference. Therefore, research on 
schema learning has focused on inferring regular expressions 
from a set of given samples. 

We focus on learning regular expressions with interleaving 
(shuffle), denoted by RE(&). Since RE(&) are widely used in 
various areas of computer science [4], including XML database 
systems [14, 19, 34], complex event processing [33], system 
verification [10, 21, 23], plan recognition [26] and natural 
language processing [27, 39]. 
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Inference of regular expressions from a set of given samples 
belongs to the problem of language learning. Gold proposed 
a classical language learning model (learning in the limit 
or explanatory learning) and pointed out that the class of 
regular expressions could not be identifiable from positive 
samples only [24]. This means that no matter how many 
positive samples from the target language (i.e., the language 
to be learned) are provided, no algorithm can infer every 
target regular expression. Hence, researchers have turned to 
study subclasses of regular expressions [2, 5, 7, 9, 13, 17, 22, 
30, 32, 38, 43]. 

Most existing subclasses of regular expressions for XML 
are defined on standard regular expressions, e.g., [5-7, 16, 35] 
which were analyzed together in [28, 31]. For single occur- 
rence regular expressions (SOREs), in which each symbol 
occurs at most once and its subclass chain regular expression- 
s (CHAREs), Bex et al. proposed two inference algorithms 
RWR and CRX [7, 8]. Freydenberger and Kótzing [17] pro- 
posed more efficient algorithms Soa2Sore and Soa2Chare for 
the above mentioned SOREs and CHAREs. Bex et al. [5] also 
studied learning algorithms, based on the Hidden Markov 
Model, for the subclass of regular expressions (k-OREs) in 
which each alphabet symbol occurs at most k times. No- 
tice that none of the above subclasses support an important 
feature in XML, i.e., the interleaving. 

'There may be no order constraint among siblings in data- 
centric applications [1]. In such cases the interleaving is neces- 
sary. Here we list the more recent efforts on RE(&) inference 
(see [13, 30, 32, 40, 43]). The aim of these approaches is 
to infer restricted subclasses of single occurrence RE(&), 
in which each symbol occurs at most once, starting from a 
positive set of words. Ciucanu and Staworko proposed two 
subclasses disjunctive multiplicity expression (DME) and 
disjunction-free multiplicity expression (ME) [11, 13] which 
support unordered concatenation, a weaker form of inter- 
leaving. The concatenation operator is disallowed in both 
formalisms and ME even uses no disjunction operator. For 
example, rı = (a|b* )&c is a DME and r2 = a&b*&c’ is an 
ME. But r3 = (atb’)&c* and ra = a*((b*|c)&d*) do not 
satisfy both formalisms. The inference algorithm based on 
mazimum clique for DME was given in [13]. Li et al. provided 
an algorithm to learn DMEs from both positive and negative 
examples based on genetic algorithms and simplified candi- 
date regions (SCRs) [29]. When there is no order constraint 
among siblings, the relative orders within siblings are still 
important. Peng and Chen [40] proposed a subclass SIRE 
using the grammar: S := T&S|T, T ::— eJ|a|a*|TT. But it 
does not support the union operator. For example, r2 and 
r3 are SIREs but rı and r4 are not. Besides, they presented 
an approximate algorithm to infer SIREs [40]. Li et al. [30] 
proposed a subclass ICRE using the grammar: 


E:= FP... FP", (n > 0,p; € (?,1)), 
Fi = D,&.. & Dk, (i € [1, n], k > 1), 
D; Sar). omm. (j € [1,k],m > 1), 


RE(&) RE(&) 


DME SIRE 
ICRE ICHARE 


(a) ME C DME C ICRE 


(b) ME C SIRE C ICHARE 


RE(&) RE(&) 


SOIRE 


(c) DME n SIRE — ME (d) ICRE C ESIRE C SOIRE 
C RE(&), ICHARE C ESIRE C 


SOIRE C RE(&) 


Figure 1: Relationships among ME, DME, SIRE, I- 
CRE, ICHARE, ESIRE, SOIRE and RE(&). 


where mul; € {1,?,*,+} and ao € X for o € [1,m]. For 
example, ri, ro and r4 are ICREs but r3 is not. Besides, 
they presented an approximate algorithm to infer ICREs [30]. 
Zhang et al. [43] proposed a subclass called ICHARE consider- 
ing interleaving. The inference algorithm is based on SOA and 
maximum independent set (MIS). However, components of in- 
terleaving are restricted to the extended strings (ES) defined 
in [43]. For example, r2 and rs are ICHAREs but ri, r4 and 
r5 = a' ((b* |c)d* &ef") are not. Li et al. [32] proposed a practi- 
cal subclass called ESIRE and designed an inference algorithm 
GenESIRE to infer ESIREs. For example, ri, r2, 73, ra and 
rs are ESIREs, but re = a*b'(fm'&c'd|e(n|l)' g&h^)(3*|k): 
is not. All of the above subclasses are restricted subclasses 
of single occurrence RE(&). As shown above, the support for 
interleaving in existing work is very limited. 

In this paper, based on the analysis of large-scale real data, 
we propose a new subclass of RE(&), i.e., single occurrence 
RE(&), called SOIRE. The relationships among ME, DME, 
SIRE, ICRE, ICHARE, ESIRE, SOIRE and RE(&) are shown 
in Figure 1. Among them, ME c DME c ICRE, ME c SIRE 
C ICHARE, DME N SIRE = ME, ICRE C ESIRE c SOIRE 
C RE(&) and ICHARE C ESIRE C SOIRE C RE(&). For 
example, all of r1, r2, ra, ra, rs and re are SOIREs. It reveal- 
s that SOIRE is more powerful than the above subclasses 
since the latter are all subclasses of SOIRE, and especially 
SOIRE has unrestricted support for interleaving, which was 
never achieved by existing work. T'hen, we develop the corre- 
sponding learning algorithm, ?SOIRE, to carry out SOIREs 
inference automatically. The massive experimental results 
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demonstrate the practicality of the proposed subclass as well 
as the preciseness and conciseness of iSOIRE. 
The main contributions of this paper are listed as follows. 


e We propose a new subclass SOIRE of RE(&). SOIRE 
is more powerful than the existing subclasses and es- 
pecially has unrestricted support for interleaving. 

e Correspondingly, we design an inference algorithm 
iSOIRE which can learn SOIREs effectively based on 
single occurrence automaton (SOA) and maximum 
independent set (MIS). 

e We conduct a series of experiments, comparing the per- 
formance of our algorithm with both ongoing learning 
algorithms in academia and industrial tools in real- 
world. The results reveal the practicability of SOIRE 
and the effectiveness of iSOIRE, showing the high pre- 
ciseness and conciseness of our work. 


The rest of this paper is organized as follows. Prelimi- 
naries are presented in Section 2. Section 3 provides the 
learning algorithm. Then a series of experiments is presented 
in Section 4. Finally we conclude this work in Section 5. 


2 PRELIMINARIES 
2.1 Definitions 


Let X be a finite alphabet of symbols. The set of all words 
over X is denoted by X*. The empty word is denoted by e. 


Definition 2.1. Regular Expression with Interleav- 
ing. A regular expression with interleaving over X is defined 
inductively as follows: € or a € X is a regular expression, for 
regular expressions rı and r2, the disjunction ri|ro, the con- 
catenation rı ro, the interleaving ri &r», or the Kleene-Star 
rj is also a regular expression. r? and r+ are abbreviations 
of r|e and r - r*, respectively. They are denoted as RE(&). 


The size of a regular expression r, denoted by |r|, is the 
total number of symbols and operators occurred in r. The 
language L(r) of a regular expression r is defined as fol- 
lows: L(Ø) = Ø; L(e) = {e}; L(a) = {a}; L(rt) = Lr)"; 
L(ri-r2) = L(ri)L(ra); L(ri|r2) = L(ri)UL(ra); L(ri&rz) = 
L(ri)&L(r2). Let u = au’ and v = bv’ where a, be and 
u',v'€X*, then u&e = c&u = {u} and u&v = a(u'&v)U 
b(u&v'). For example, L(ab&c) = (cab, acb, abc}. 


Definition 2.2. Single Occurrence Regular Expres- 
sions with Interleaving (SOIRE). A regular expression 
with interleaving is SOIRE, in which each symbol occurs at 
most once. 


For instance, ri = a'(b'c&d'(e|f)') is an SOIRE, but 
r2 = a* b&c* b is not because b appears twice. 


Definition 2.3. Single Occurrence Automaton (SOA) 
[7, 17] Let X be a finite alphabet. src and snk are distinct 
symbols that do not occur in X. A single occurrence automa- 
ton (short: SOA) over X is a finite directed graph .A — (V, E) 
such that 


(1) src, snk € V, and V € XU (src, snk}; 
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(2) src has only outgoing edges, snk has only incoming 
edges and every node v € V lies on a path from src to 


snk. 
Figure 2: Example SOA A for r = a(bc)'d*. 


For example, the SOA A for r = a(bc)'d* is shown in 
Figure 2. A generalized single occurrence automaton 
(generalized SOA) over X is defined as a directed graph 
in which each node v € V V (src, snk} is an SOIRE and all 
nodes are pairwise alphabet-disjoint SOIREs. 


3 LEARNING ALGORITHM 


In this section, we give the learning algorithm ;SOIRE, which 
efficiently infers an SORE from a set of positive samples S. 
We show the major technical details of our algorithm in this 
section. The input and output of the algorithm iSOIRE is 
a set of given samples and an SOIRE respectively. The al- 
gorithm ¿SOIRE consists of two steps, constructing an SOA 
from samples, and converting the SOA into an SOIRE. Con- 
structing an SOA from samples is introduced in Section 3.1. 
Converting the SOA into an SOIRE is given in Section 3.2. 


Algorithm 1: iSOIRE 

Input: a set of positive sample S 

Output: an SOIRE 
1 Construct SOA A for S using method 2T-INF [20]; 
2 return Soa2Soire(S, A) 


3.1 Constructing an SOA from Samples 


We use method 2T-INF [20] to construct SOA A for S. The 
algorithm 2T-INF [20] used in the algorithm is proved to con- 
struct a minimal-inclusion generalization of S. Here minimal- 
inclusion means that there is no other SOA A such that 
S C L(A) C L(SOA(S)). 

Here we give an example to show the execution process. Let 
S={begk, aabengk, abegj j, beg, hk, behgj, belhg, bheg, bf cmd, 
bfdm, af mcd, adf) . Using method 2T-INF, we construct the 
graph SOA(S) shown in Figure 3. 


3.2 Converting the SOA into an SOIRE 


We use dot-notation to denote the application of subroutines. 
For a given SOA A, we let A.src and A.snk denote the source 
and the sink of A, respectively. We let V be the set of vertices 
and E the set of edges in A, respectively. 
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Figure 3: Constructing SOA A for S. 


e For any vertex v € V, we let A.pred(v) denote the 
set of all predecessors of v in A; similarly, A.succ(v) 
denotes the set of all successors of v in A. 

e For any vertex v € V, we let A.reach(v) be the set of 
all vertices reachable from v. 

e “first” returns all vertices v such that the only prede- 
cessor of v is the source in A. 

e “contract” on SOA A takes a subset U of vertices of 
A and a label 6. The procedure modifies A such that 
all vertices of U are contracted to a single vertex and 
labeled 6 (edges are moved accordingly). 

e “extract” on SOA A takes as argument a set of vertices 
U of A; it does not modify A, but returns a new SOA 
with copies of all vertices of U as well as two new 
vertices for source and sink; all edges between vertices 
of U are copied, all vertices in U having an incoming 
edge in A from outside of U have now an incoming 
edge from the new source, and all vertices in U having 
an outgoing edge in A to outside of U have now an 
outgoing edge to the new sink. 

e “addEpsilon” on SOA A adds a new vertex labeled 
€; all outgoing edges from the source to vertices that 
have more than one predecessor (vertices, that are not 
in the first-set) are redirected via this new vertex. 

e “exclusive” on SOA A on argument v (a vertex of A) 
returns the set of all vertices u such that, on any path 
from the source to the sink that visits u, v is necessarily 
visited previously. Intuitively, the exclusive set of a 
vertex v is the set of all vertices exclusively reachable 
from v, not from any other vertex incomparable to v. 


Y. Li et al. 


Furthermore, we use the following eight subroutines or 
algorithms. 


“plus” on label ô returns ó*. 

“or” on labels 6 and ó' returns 6/6’. 

“concatenate” on labels ô and 6’ returns 6 - 6’. 

“filter” on a subset U of vertices and a set of given 
sample S returns a new subset S". For string s € S each 
symbol of which is computed as follows: 75(U, si) = si 
if s; € U; m,(U,si) = £ otherwise. And the result is 
reduced by xe = ex = x. For example, let U={b,c,r} 
and S = {abgr, ebbdfc}, S' = filter(U, S) = {br, bbc}. 
“Merge” on a set of positive samples S returns an 
expression ¢ with interleaving. 

For a set of positive sample S, we let por(S) denote the 
set of all partial order relations of each string in S and 
cs(S) denote the constraint set. The cs(S) is defined 
as follows. cs(S) = {(x, y)| (x,y) € por(S) and (y, x) € 
por(S)}. 

“combine” on a subset U of vertices returns a new ver- 
tice, which combines all vertices in U with interleaving 
operator. For example, let U = {a*,b*}, combine(U) 
is a" &b*. 

“clique_removal” on an undirected graph G returns a 
maximum independent set (MIS). Finding an MIS of 
a graph G is a NP-hard problem. Hence we use the 
method clique.removal() [12] to find an approximate 
result. 


The algorithm Soa2Soire is given in Algorithm 2. The main 
procedures are as follows. 


(1) 
(2) 


(3) 


(4) 
(5) 


We first deal with all strongly connected looped com- 
ponents, replace each with a new vertex. 

After the SOA is a directed acyclic graph (DAG), focus 
on the set F of all vertices which can be reached from 
the source directly, but not via other vertices; make 
sure that there are no vertices which can be reached 
directly and via other vertices (if necessary, add an 
auxiliary node labeled e€). 

Recurse on the sets of vertices exclusively reachable 
from a vertex in F and contract these sets to vertices 
labeled with the result of the recursion. 

Combine vertices in F with “or”, recurse again on what 
is exclusively reachable from this new vertex. 

Once only one item is left in F, split it off and recurse 
on the remainder. 


Note that the algorithm introduces “?” by way of con- 
structing “or £”. This can be cleaned up by postprocessing 
the resulting SOIRE. 

The algorithm Merge is given in Algorithm 3. The main 
procedures are as follows. 


(1) 
(2) 
(3) 
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The first step (line 1): We first compute the constraint 
set constraint. tr using the function cs(S). 

The second step (line 4): We construct an undirected 
graph G using element in constraint tr as edges. 

The third step (lines 5-8): We select a maximum in- 
dependent set (MIS) of G, add it to list all. mis and 
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Algorithm 2: Soa2Soire 
Input: a set of positive sample S; an SOA A = (V,E) 
Output: an SOIRE 


1 if |E| =O then return 2; 
2 else if |V| 2 2 then return e; 
3 else if .A has a cycle then 
4 Let U be a strongly connected component of A; 
5 if |U| = 1 then 
6 Let v be the only vertice of U; 
7 | A.contract(U ,plus(v.label())); 
8 else A.contract(U,Merge(filter(U, S))); 
9 else if A.succ(A.src) Z A.first() then 
10 A.addEpsilon(); 
11 else if |A-first()| = 1 then 
12 Let v be the only successor of src; 
13 6 + v.label(); 
14 A.contract({A.src,v},src); 
15 5’ & Soa2Soire(S,A); 
16 return concatenate(6,0’ ); 
17 else if dv € A.first(), A.exclusive(v) Z (v) then 
18 Let v be such that .A.exclusive(v) 4 (v); 
19 U + A.exclusive(v); 
20 A.contract(U ,Soa2Soire(S,A.extract(U))); 
21 else 
22 Let u,v € A.first() with u 4 v s.t. A.reach(u) N 
A.reach(v) is C-maximal; 
23 A.contract({u,v},or(u.label(),v.label())); 


24 return Soa2Soire(S,A); 


delete the MIS and their related edges from G. The 
process is repeated until there exists no nodes in G. 

(4) The fourth step (lines 9-13): We get the sample set 
S' using the function filter(mis, S) for each MIS, and 
construct SOAs for sample sets by calling the algorithm 
2T-INF [20]. Then convert SOAs into SOIREs using 
algorithm Soa2Soire. 

(5) The last step (line 14): We call the function combine 
to generate an expression C with interleaving operator. 


Following the example in section 3.1, there are four strongly 
connected components U; = {a}, U2 = {j}, Us = (f, d, m, c} 
and U4 = {l,g,h,e,n} shown in Figure 4. For strongly con- 
nected component (SCC) U, = {a}, because |Ui| = 1, we 
use A.contract(Ui,plus(j)) to modify A such that vertice 
a is contracted to a new vertex a* and the self-loop is re- 
moved. Similarly, we use A.contract(U2,plus(j)) to modify 
A such that vertice j is contracted to a new vertex j* and 
the self-loop is removed (Figure 5). For SCC Us, because 
[Us| > 1, so we should call A.contract(U3,Merge(filter(Us, 
S))). In this sub-process, we first compute the new sample set 
S1—( f med, f emd, df, f dm) using function filter(U3,S). Then 
we get cs(51) = {(f,d), (d, f), (m, d), (d, m), (m, c), (c, m)) in 
the algorithm Merge. Next, we constructing undirected graph 


Boe ee 
wo Neo 


m 
m 
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Algorithm 3: Merge 
Input: a set of positive sample S 
Output: an epression Ç 
constraint. tr + cs(S); 
U +- ð; 
G + Graph(constraint tr); 
all. mis + Ø; 
while |G.nodes()| > 0 do 
W + clique_removal(G) [12]; 
G —GNW; 
all mis.append(G) 


o AN OA k WN 


foreach mis € all mis do 
S' & filter(mis, S) 
Construct SOA A for S’ using method 2T-INF [20]; 
5 + Soa2Soire(S' A) 
U.append(6) 


return ¢ + combine(U) 


Figure 4: Four SCCs of SOA. 


Gi based on cs(51) shown in Figure 6. We compute the set 
of all maximum independent sets (all. mis = {{ f, m}, (c, d) ]) 
for Figure 6. We construct two SOAs using filter(( f, m}, 
Sı) and filter((c, d), 51), respectively. They are shown in 
Figure 7 and Figure 8. We convert two SOAs into fm’ and 
c' d, respectively. Then we get the new label ¢ = fm'&c'd 
using combine( fm" ,c' d). We use A.contract(U3,¢) to modi- 
fy .A such that all vertices of U3 are contracted to a single 
vertex and labeled C (edges are moved accordingly) shown in 
Figure 9. Similarly, we also call .A.contract(U ,Merge(filter (U4, 
S))). We first compute the new sample set S2 = (egh, eng, eg, 
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elhg, ehg, heg} using filter(U4,S). Then we get cs(S2) = 
{(g, h), (h, g), (h, e), (e, h)) in the algorithm Merge. Next, we 
constructing undirected graph G2 based on cs(.S2) shown in 
Figure 10. We compute the set of all maximum independent 
sets {{l,g,e,n}, {h}} for Figure 10. We construct two SOAs 


using filter({l,g,e,n}, S2) and filter({h}, S2), respectively. 


They are shown in Figure 11 and Figure 12. We convert two 
SOAs into e(n|l)’g and h’, respectively. Then we get the new 
label ó = e(n|l)' g&h? using combine(e(n|l)’g,h’). We use 
A.contract(U4,6) to modify A such that all vertices of U4 are 
contracted to a single vertex and labeled 6 (edges are moved 
accordingly) shown in Figure 13. Continue to execute the 
remaining processes of the algorithm iSOIRE and we get the 
final inferred result r=a*b" (fm &c’ dle(n|l)’ g&h")(j*|k)?. 


<i 


Figure 5: Dealing with SCC Uı and Us of SOA A. 


Figure 6: Constructing undirected graph G1. 


9-0.0-9 


Figure 7: Constructing SOA A, of filter(( f, m}, Sı). 
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0-000 


Figure 8: Constructing SOA Az of filter((c, d), 51). 


C= fm ked 


Figure 9: Dealing with SCC Us of SOA A. 


] g h e n 


Figure 10: Constructing undirected graph Go. 


M. 


Figure 11: Constructing SOA As. 


Figure 12: Constructing SOA .A;. 


Figure 13: Dealing with SCC U4 of SOA A. 
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4 EXPERIMENTS 


In this section, we conduct a series of experiments to analyze 
the practicability of SOIRE, and compare algorithm iSOIRE 
with not only the learning algorithms from ongoing research- 
es but also the industrial-level tools used in real world. In 
terms of preciseness and conciseness, our work has achieved 
satisfying results compared with existing methods, reaching 
higher preciseness with less description length. Specifically, 
indicators Language Size (|£(r)|) [5] and datacost (DC) [5] 
are used to measure preciseness, while Cen [30] and Nesting 
Depth (ND) [31] for conciseness. Similar as the discussion of 
|£(r)| and Len above, we have that larger the value of DC 
(ND) is, more precise (concise) the regular expression will be. 
Language Size [5], denoted by |L(r)|, is defined as: 


Lmax 


IL(r)| = » IQ) 


£—1 


where |L'(r)| is the size of subset containing words with 
length £ in L(r). Generally, L(r) is an infinite language with 
infinitely large value of £, it is of course impossible to take all 
words into account. Hence, we only consider the word length 
£ up to a maximum value: mar = 2m + 1 where m is the 
length of r excluding £, Ø and regular expression operators. 
Language Size (|C(r)|) can well measure the preciseness of a 
regular expression. Smaller the value of | Z(r)| is, more precise 
the regular expression will be. datacost (DC) [5], is defined 
as: 


AM IL“(r) 
datacost(r, S) = 5 2 x loga£ + log2 154 ; 


£—1 


where €max = 2m+1 and |L‘(r)| as before, |S^| is the number 
of words in S that have length £. Smaller the value of DC 
is, more precise the regular expression will be. Zen [30] is 
defined as: 


Len = n x [logs (|| + |MI)]; 


where |X] is the number of distinct symbols occurring in regu- 
lar expression r, M is the set of metacharacters {|,-, &, ?, x, +, 
(,)} and n is the length of r including symbols and metachar- 
acters. An expression with a smaller value of Zen is more 
concise. Nesting Depth (IND) [31] is defined as: 


e ND(r) = 0, ifr 2e, ora fora € X. 

e ND(r) = ND(ri) + 1, ifr = ri, r=rj orr =rj, 
where rı is a regular expression over X. 

e ND(r) = max{ND(r1),ND(r2)}, if r = ri|ra, r = r1 -r2 
or r = ri&ro, where rı and r2 are regular expressions 
over X. 


The learning algorithms compared in experiments are Soa2Sore [17] 


and Soa2Chare [17], GenEchare [16], learner} mg [13], con- 
Miner [40], GenICHARE [43] and GenESIRE [32]. The in- 
dustrial tools which are capable of supporting inference of 
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Figure 14: The proportion of subclasses on Relax 
NG. The dataset used for this statistical experiment 
is acquired from [28], with 509,267 regular expres- 
sions from 4,526 Rleax NG schemas. 


XML schemas used in this section include IntelliJ IDEA!, 
Liquid Studio?, Trang?, and InstanceToSchema’*. 

For the massive comparative experiments, we conduct the 
experiments based on two kinds of datasets: small dataset 
(i.e., mastersthesis) and large dataset (i.e, www) of XML 
documents, which are both extracted from DBLP. DBLP is 
a data-centered database of information on major computer 
science journals and proceedings. We download the file of 
version dblp-2015-03-02.aml.gz°. mastersthesis and www are 
two elements chosen from DBLP with 5 (small) and 2, 000, 226 
(large) samples, respectively. 

All of our experiments are conducted on a machine with 
16 cores Intel Xeon CPU E5620 @ 2.40GHz with 12M Cache, 
24G RAM, OS: Windows 10. 


4.1 Usage of SOIRE in Practice 


Though interleaving is indispensable in data-centric applica- 
tions, the lack of research on it is still a concern. In Figure 14, 
we visualized the coverage rates of regular expressions cov- 
ered by different subclasses on Relax NG. We can see that 
the initial subclass, DME, only covers 50.62%. Then the pro- 
portions show an upward trend, reaching more than 85.55% 
(ICRE, ICHARE, ESIRE). Compared with their coverage, 
SOIRE covers 93.24%, which is 5.68% more than the second 
largest proportion. Therefore, the experimental result reveals 
the high practicality of SOIRE, and its strong support for 
interleaving. 


4.2 Analysis of Inference Results 


To better illustrate the performance of our work, we first 
compare the inferred results of our work with that of existing 
learning algorithms and industrial tools in real world. To 
save space, we use the short names of words and the list of 
abbreviations is shown in Table 1. The experimental results 
are shown in Table 2-5. 


https: //www.jetbrains.com/idea/ 

https: //www.liquid-technologies.com/ 
3http://www.thaiopensource.com/relaxng/trang.html 
^http:/ /www.xmloperator.net /12s/ 
5http://dblp.org/xml/release/dblp-2015-03-02.xml.gz 
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Table 1: The list of abbreviations for words in DBLP. 


Word Abbr. Word Abbr. Word Abbr. 
author a editor b title c 
booktitle d pages e year f 
address g journal h volume i 
number j month k url 1 
ee m cdrom n cite o 
publisher p note q crossref f 
isbn s series t school u 

chapter v publnr w 


We can see from Table 3 that for dataset mastersthe- 
sis, the first six algorithms/tools (Liquid Studio, Soa2Sore, 
Soa2Chare, GenEchare, IntelliJ IDEA and Trang) reach 
high conciseness at enormous cost of |£(r)|, from unafford- 
able 1.57 x 101? to 1.64 x 10*. Algorithms/tools Instance- 
ToSchema, learnert, Mg and conMiner have highest concise- 
ness, with 52 for Len, yet their preciseness is not the highest 
among these algorithms. Finally, the last three algorithms 
including ?SOIRE reach the performance at the same level, 
with highest preciseness and the equal magnitude of concise- 
ness. From the table we can draw a conclusion that though 
interleaving could improve the preciseness, the former one 
sacrifices the conciseness to some degree. 


Table 2: Expressions of inference using different 
learning algorithms/inference tools on mastersthe- 
sis. 


Method Regular Expression 
Liquid Studio (a|c|f|u|l|m)'* 
Soa2Sore acfu(1|m) * 
Soa2Chare acfu(1|m) * 
GenEchare acfu(1|m) * 
IntelliJ IDEA acfu(1|m)* 
Trang acfu(1|m)* 
InstanceToSchema a&c&f&l’ &m/&u 
learner} mE a&c&f&l’ &m/&u 
conMiner acful? &m? 
GenICHARE acfu(1? &m) 
GenESIRE acfu(1? &m) 
iSOIRE acfu(1? &m?) 


For the second dataset (Table 5), the advantage of our 
work is more outstanding. Without supporting the usage 
of interleaving, the previous eleven algorithms/tools have 
huge |Z(r)| and DC, from 1.11 x 10?! to 4.39 x 10"? and 
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Table 3: Results of inference using different learning 
algorithms/inference tools on mastersthesis. 


Method |£()] DC Len | ND 
Liquid Studio 1.57 x 1019 122.880 56 1 
Soa2Sore 1.64 x 104 67.657 56 1 
Soa2Chare 1.64 x 104 67.657 56 1 
GenEchare 1.64 x 104 67.657 56 1 
IntelliJ IDEA 1.64 x 104 67.657 56 1 
Trang 1.64 x 104 67.657 56 1 
InstanceToSchema 984 102.446 52 1 
learner} mg 984 102.446 52 1 
conMiner 13 72.886 52 1 
GenICHARE 5 65.072 60 1 
GenESIRE 5 65.072 60 1 
iSOIRE 5 65.072 60 1 


from 15158.773 to 8479.873, respectively. Among them, Liq- 
uid Studio, Soa2Chare and IntelliJ IDEA have the shortest 
Len, which are 120, while learner} mg and ESIRE have the 
longest, which are 175. Soa2Sore has the deepest ND [31], 
with 3, followed by Liquid Studio, GenEchare, GenICHARE 
and GenESIRE, with 2 nestings. On the other hand, the 
algorithms/tools which support interleaving have smaller 
values on average. Especially for the indicator |£(r)|, the 
magnitudes are much smaller than that of the first group of 
methods. It is noteworthy that our work reaches almost the 
same conciseness with much less values of |Z(r)|(1.84 x 101") 
and DC(7599.996). 


Table 4: Expressions of inference using different 
learning algorithms/inference tools on www. 


Method Regular Expression 
Liquid Studio (alcl1t+ |qt Jo|b|f|m|d|r)+ 
Soa2Sore b* (a* (c(m* |d))* (I|q|f]o)*)+ |r 
Soa2Chare b*r? (m o|flall|a]c[d)* 
GenEchare (bt |r)? (m|o* |f|a* |1* |q* |c|d) * 


IntelliJ IDEA 1? b* (a|d|o|mlqlell|£) * 
Trang b* (r|(ald[o[mlalc[1]f)* ) 
m &q* &b* &f^ &a* &o* &c! &d! Ber” Belt 


InstanceToSchema 


learner} arp (q* j£ [r^ )&(o* ja? [m?)&(a* |b*)&c* &1* 


conMiner r!b*c*o*d' m" f? &a*q* &1* 


GenICHARE (b+ |r)? (a*q*d?m? &c t o* f? &1*)? 
GenESIRE (b* |r)? (a* (m? |q* |d) &c(o* |f) &1*)? 
iSOIRE b* ((a* (q* |a?) |m)&c(o* |f) &1*)|r 


It is clear from the above analysis, our work outperform- 
s other state-of-the-art learning algorithms and published 
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Table 5: Results of inference using different learning 
algorithms/inference tools on www. 


Method |L(r)| DC Len | ND 
Liquid Studio .11 x 1021 5158.773 120 2 
Soa2Sore .30 x 101? 7190.139 165 3 
Soa2Chare .36 x 101? 3696.752 120 1 
GenEchare .34 x 101? 3685.703 150 2 
IntelliJ IDEA .36 x 101? 3696.752 120 1 
Trang .20 x 1019 3606.698 125 1 
InstanceToSchema .53 x 1018 3406.824 145 1 
learner} ug .43 x 1015 1150.850 175 1 
conMiner 4.11 x 1013 0453.822 145 1 
GenICHARE .41 x 1013 9961.492 170 2 
GenESIRE 4.39 x 1011 8479.873 175 2 
iSOIRE 1.84 x 1011 7599.996 165 1 


tools, achieving the highest preciseness and the equal level 
of conciseness. Furthermore, through the comparison, the 
performance of our method indicates that the involvemen- 
t of interleaving could contribute to both preciseness and 
conciseness. 


5 CONCLUSION AND FUTURE WORK 


Based on large-scale real data, we proposed a new subclass 
SOIRE of regular expressions with interleaving. SOIRE is 
more powerful than the existing subclasses and has unre- 
stricted support for interleaving. Correspondingly, we design 
an inference algorithm ?SOIRE which can learn SOIREs ef- 
fectively based on single occurrence automaton (SOA) and 
maximum independent set (MIS). We conduct a series of ex- 
periments, comparing the performance of our algorithm with 
both ongoing learning algorithms in academia and industrial 
tools in real-world. The results reveal the practicability of 
SOIRE and the effectiveness of iSOIRE, showing the high 
preciseness and conciseness of our work. 

We will study another subclass of regular expressions: k- 
occurrence regular expressions with interleaving (k-OIREs) 
in our future work. Its inference algorithm will also be con- 
sidered. 
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ABSTRACT 


Following technological advances carried out recently, there has 
been an explosion in the quantity of videos available and their 
accessibility. This is largely justified by the fall of the prices of 
acquisition and the increase of the capacity of the memory 
supports, which made the storage of the large document video in 
computer system possible. To allow an effective exploitation of 
the collections, it is necessary to install tools facilitating the 
access to the documents and handle them. In this context, we 
propose a multimedia retrieval approach that puts the user at the 
center of the retrieval process starting from a text query. The 
new aspects of our proposal is as follows: (i) concerning the 
indexation part, we propose a new approach allowing a multi- 
level and semantic classification of videos, (ii) regarding the 
retrieval part, the inclusion of query expansion mechanism helps 
the user to formulate the query and the relevance feedback 
mechanism which helps improve the results considering the 
user’s feedback. Our contribution at the experimental level 
consists in the implementation of prototype VISEN. In fact the 
technique proposed have been integrated in system seeks by the 
contents to evaluate the contribution in terms of effectiveness 
and precision. After carrying out a set of tests on 2700 videos and 
62838 images, the experimental results showed that the proposed 
algorithm performs well. 
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1. Introduction 


Information retrieval consists of a set of operations to respond 
to the user’s information needs via a GUI. First, the user must 
build a query. This obvious operation for the text is much 
more difficult for the image and the video queries. Indeed, the 
query can include different data: such as an image, a video, a 
sound, a drawing or an animation. The definition of queries is 
considered among the raised problems when searching large 
databases. The fact that the expressed query does not always 
translate the need for information in the head of the user by a 
textual request, although this form of request remains the 
most preferred by users than other forms (images, video, 
sketches, etc.)[1]. In what follows, we present several forms of 
interrogation and retrieval that are found in the different 
prototypes in the literature. In general, there are two great 
retrieval approaches: the textual queries and the conceptual 
queries. 


In the textual form, the queries remain limited and are 
generally associated with a specific category of visual 
documents, such as TV news. The most promising way in this 
context regarding the videos consists of transcribing the 
sound track to determine its subject [2][3], rather than 
exploiting the visual content. 


The conceptual queries are the subject of many researches 
works. For example, the INFORMEDIA approach uses a 
limited set of high-level concepts to filter the textual query 
results [4]. This system also creates groups of key-frames [5] 
and uses the results of the speech recognition to trace the 
collections of key-frames at the re-associated geographic 
place on a map and combine this with the other visualizations 
to give the user an arrangement of the query result context. 
The method suggested in [6] is based on a process of semantic 
indexing. This system uses a big semantic lexicon in 
categories and threads to support the interaction. It also 
defines a space of visual similarity, a space of semantic 
similarity, a semantic thread space and browsers to exploit 
these spaces. It is worth mentioning that the VERGE approach 
[7] supports the following functions: (i) a high level of visual 
conceptual retrieval and (ii) a visual retrieval. This tool 
combines indexing, analysis and recovery techniques of 
diverse modalities (textual, visual and conceptual). 


In recent years, video retrieval based on the semantic concept 
has attracted the attention of many researchers [8] [9]. We 
mean by the detection of semantic concepts, detected the 
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presence or the absence of high-level concepts such as bus, forest 
or sky in the videos. In [10], the authors combine the concept 
matching, the query, the corpus and the matching of the content. 
In [11, 12], the authors propose a video retrieval system by 
viewing and navigating in concepts. The proposed navigation 
module is based on a semantic classification. In [13, 14], the 
authors propose a method for the indexation and retrieval of 
video sequences in a large video database, based on a weighting 
technique that calculates the degree of membership of a concept 
in the audio-visual field. In [15, 35], the authors suggest a 
graphical approach aimed at providing the users with the 
possibility to dynamically display and explore a query result 
space built upon a document repository including interconnected 
media objects. [16, 35] introduces an interactive video browsing 
system based on content and developed for the Video Brower 
Showdown 2016. The aim of this system is to help the user find 
specific video clips among a large collection of videos within 
limited time frames. 


Although many research efforts have been devoted to the 
detection of concepts, this task remains very difficult [17]. Most of 
the time, the problem is considered as a classification problem in 
which a binary classifier generally learns to predict the presence 
of a certain concept in a video sequence or key-frame based on 
the extracted feature descriptors. 

Traditional content-based video search methods could not 
respond alone to the user’s needs and have shown their 
limitations despite the great efforts for improvement [26]. In this 
context, Etter [19] improved his video search system by focusing 
on expanding the queries. To reach this goal, he used external 
data, such as Wikipedia content (i.e, titles and images). On the 
other hand, Elleuch et al. [20] implemented three automatic 
search subsystems which consist in extracting texts from videos, 
detecting visual and audio features. In [21], the authors developed 
a content-based video retrieval system to extract the color, texture 
and shape. First, the texture feature was extracted using the 
multi-fractal Brownian motion (mbm), then, the color feature was 
obtained using semantic color model and finally, the shape feature 
was extracted using the level set method. Then, all these features 
were stored in the indexing format. On the other hand, the 
IMOTION system [22] is a multimodal content-based video search 
and browsing application offering an arch set of query modes on 
the basis of a broad range of different features that can scale with 
the size of the collection due to its underlying flexible polystore 
called ADAM pro and its retrieval engine Cineast, for multi- 
feature fusion. In [25], the authors presented the semantic video 
indexing system of REGIMVid group to semantically access the 
multimedia archives, called SVI REGIMVid, which is a generic 
approach for video indexing. 

As indicated above, many research works have been carried out in 
the large audiovisual retrieval domain and suggested some tools 
based on diverse retrieval forms. These techniques have been 
formulated for the indexation of a video by its low-level content. 
Most of these studies put forward a very limited and specific 
study framework by proposing an approach based on a single 
modality. . This is justified by the fact that these media are related 
to fairly specific fields and sometimes very far. However, this 
attempt has not yet led to satisfactory results. On the other hand, 
semantic content indexing based on a combination of 
characteristics from different domains is innovative. Indeed, since 
the video components complement one another, the video will be 
semantically efficient 


Moreover, these techniques have other limitations which are 
often related to the interaction with the user. Therefore, enriching 


the retrieval techniques with the user’s past behavior 
potentially helps provide more pertinent results. In the 
literature, few research works have been interested in this 
aspect. For instance, the approach suggested in [26], was the 
first to put the user at the center of the retrieval process. In 
fact, the execution of a semi-automatic approach, which put 
the user at the center of the retrieval method, is a real 
improvement perspective of the existing methods. 


Actually, the method suggested in our paper lies in this 
context as it proposes a large video retrieval approach starting 
from a textual query. Therefore, it can be said that our study 
has brought some contributions, such as (i) promoting an 
indexation approach based on DCM (data clustering method), 
(ii) providing a relevance feedback mechanism which consists 
in putting the user at the center of the retrieval process (iii) 
offering a query expansion mechanism which helps 
reformulate the textual query. 


The remaining part of this paper is organized as follows: The 
system framework is described in sections 2, 3 and 4, the 
experimental setup and the results are presented in section 5, 
and finally, in section 6, we conclude with a summary and 
perspective of our works research.. 


2. General Description Of Proposed System 


Fig.1 presents the global architecture of the suggested system 
VISEN (VIdeo System Engine). Like any information retrieval 
system, our VISEN system is decomposed according to the 
following functional phases: 


Indexing Phase: Indexing is intended to extract and 
represent the meaning of a document so that it can be 
retrieved by the user. To reach this goal, we have suggested a 
classification approach called “DCM, Data Clustering Method” 
detailed in section 3. 


Retrieval Phase: This phase consists in exploiting the result 
of the indexation phase besides, it includes some sub-phases, 
such as (i) a Query expansion phase which guides the user to 
rephrase his query in terms of concepts. The innovation of our 
method is the way to build the concepts from the entered 
keywords. To do this, we use the ontology and a descriptor 
vector that will be explained later. (ii) a Relevance Feedback 
which consists in the interaction with the user when the 
results are displayed. Indeed, several research works are 
proposed for the relevance feedback in the context of image 
and text information retrieval but rarely for the video filing. 
Inspired by the relevance feedback method initially proposed 
by [27], we propose a new method that can be applied to video 
documents based on concepts. 


Query Reformulation 


Corpus 
Audio-visuel 
e 
y en Textual Request 
e me e» 
: 


User 


Indexing Phase 


Figure 1: Conceptual Architecture of VISEN System 
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3. DCM An Efficient Data Clustering Method 


For a relevant indexing, we propose a level consideration in the 
indexes. The organization is made according to three levels 
(contextual, conceptual, raw data) to facilitate its use during the 
retrieval phase. The idea is to associate the data and the common 
features that have a semantic relationship. In other words, videos 
with a common concept will be assigned to the same group. 
Moving to a higher level of abstraction helps us organize the data 
and make it easier to be accessed. It is possible to semantically 
group similar concepts under the same context. Indeed, the 
navigation process can be done at the three following abstract 
levels: 

Level 0: Contains all the audio-visual documents related to the 
different subjects; 

Level 1: Contains concepts, like an Actor, a Boy, a Girl, a Face...; 
Level 2: Contains the contexts representing the most relevant 
concepts in the corpus, like a person, an animal, a vehicle; 
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Figure 2: DCM Process 


3.1 Automatic Detection Of Concepts 


For the concept detection phase, we used the results of our 
semantic indexing system "SVI_LAMIRA"!, which is based on 
our low-level descriptor “PMC” [28], on the one hand, and on our 
descriptor PMGA" [28], on the other hand. 


The result of this step enables us to move from Level 0 (Data) to 
Level 1 (Conceptual) as shown in figure 3. 


b a a 4 


[ \ | 
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Adult Reporters Face Hospital 


Figure 3: Automatic detection of concepts result 
3.2 Weighting Of Concepts 


In the previous section, the concepts have been organized and 
linked to video shots that they index. However, this concept- 
concept association is not weighted. In this step, to add a 
weighting to these concepts, on the one hand, and detect a 
context level to better structure and guide the navigation of the 
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user, on the other hand. 

Concepts weighting: In order to define the weights of 
concepts, a weight is associated for each video to each 
concept to value the importance in the video of each concept. 
Each video is therefore indexed by a set of concepts. For 
concept weighting, we adopt the TF-IDF (term frequency- 
inverse document frequency) measure. This measure is the 
most used IR work because it puts into perspective the 
importance of a concept in a document. For this reason, we 
propose a new measure that combines the local and global 
weights where the former depends only on a given video and 
the latter depends on the whole corpus. It should be noted 
that the weight of a concept C for a video V is given in the 
following equation: 


Nbs(C,,V,) Nbcs(c,) 
n — NNb(V) (1) 


Wey =TF(C,,V,)-IDF(C,,V, )= 


Nbs(C,,V, ) is the number of shots containing Ci in V,. n is 


the total number of shots in A . A is the number of identified 
concepts in V. Nbc(V,) is the number of videos containing 


Ci. Nbcs(C, ) is the number of concepts similar to Ci. 
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Figure 4: Weighting of concepts 
3.3 Detection Of Contexts 


Moving to another level of abstraction helps us to organize 
the data and speed up the access. We have grouped the more 
semantically similar concepts under the same context by 
proposing a method to extract the context from the concepts. 
Our context notion is inspired by the studies presented in the 
[28] which are based on the construction of a knowledge 
coding technology called topic-map. In fact, we adopted this 
technique in the audio-visual context by proposing the 
semantic entities called "context". 

We define a context as a concept of which: 

Is the most common appearance point of view in the audio- 
visual collection. The technique explained in section 3.2.1 is 
adopted to determine the frequency of this appearance. 
Therefore, the following equation is used: 


E ()- 2; E. MA (2) 


i - (concept 1......... Concept 130} 
where E1 is the sum of concept weight Ci in all the 
Vk videos of the collection. N is the total number of 
concepts. 


The total similarity with all the other concepts is 
the highest, therefore, the following equation is 
used: 


E,(i) Ex Sim(C,,C,) (3) 


in collaboration with MIRACL laboratory of the University of Sfax- 
Tunisia. 
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Then, E2 is the sum of similarity values of Ci with all 
the concepts Cj of the ontology and N is the total 
number of concepts. The semantic relationship 
between the concepts will be detailed in section 3.2.3. 


The combination of two weights gives us the following 
equation (2) * (3) 


N 
IR ~, 


E(i) = Argmax(E, (1) E, (i)) = Arg max » sim(c,c,)- =—— |(4) 


N 


The first term represents the relationship between all the 
concepts. The second term represents the frequency of the 
concepts in the corpus where E is the set of the selected contexts, 
n is the total number of concepts and N is the number of the 
videos containing Ci 


This enables us to sort out a semantic summary (set of contexts) 
by indicating the importance of each concept in the collection. In 
addition, it provides a relevant starting point to help the user to 
begin the navigation process. 

Figure 5 shows a global view of context detection process. 
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Figure 5: Context detection process 


3.4 Inter-Concept Similarity 


The semantic similarity distance used here is inspired by the 
distance of Rada which is based on the distance between two 
concepts in the ontology [29]. It is used to define a new measure 
of similarity between the concepts. Moreover, this new measure 
considers the amount of information shared between the two 
concepts in the corpus (the more the amount of shared 
information is important the more the concepts are similar).The 
definition of this measure is intended to serve as a transition 
between the contextual and conceptual levels. The similarity 
measure is therefore defined as: 


.. 1 Card {Ci} ACI} 
dist(Ci, Cj) Card (Ci, (Cj) 


Sim(Ci, Cj) (5) 


The first term represents the probability to have both concepts 
when either of them is present. The second term shows that the 
similarity is reduced when the two concepts are far from each 
other 


sim(C,,C,) is the similarity between concepts Ci and Cj. 

(c nc} is the set of video shots in the whole corpus indexed 
with Ci and Cj. 

«ut is the set of video shots in the whole corpus indexed 
by Ci or Cj. 


dist(C,,C,) is the distance between Ci and Cj which is equal 


to the number of links separating the two concepts in the 
ontology. 


3.5 Similarity between videos we propose to classify the 
videos according to their semantic similarity: the more the 
videos have common concepts with weak differences 
between weight values, the more their content is semantically 
close. This is based on the Euclidean distance defined in the 
following equation. 


pw ei ER. Wer Wej)? (6 


N: number of common concepts between Vi and Vj. 
Wk, : weight of the concept Cx in the video Vi. 
Wkj: weight of the concept Ck in the video Vj. 


3.6 Similarity inter-plans we will adopt an approach that 
combines the measurement between a plans based on the arcs 
[13, 30] and the measurement of the informational content 
[31, 32]. For the informational content measurement, we have 
used the similarity of the cosine, while the measurement of 
the distance between the concepts is inspired by [13]. 
Through this combination, we can take advantages of both 
approaches at once. The equation is, then, defined as follows 


Sim(sh,, sh;) = Sim(c,, Cm) 
Pc j(sh4)*Pc j(sha) 


: + 
= 2: [ER (Peta) + [za (Pena) 
1 
j=1 


Xa X. dist(c;(sh1),c j(shz) 


(7 


— 


where cj(I;) : The i concept in plan 1. 

Pcj(I;) : The concept j weight in plan 1 

dist(c; (l4), cj (17) : The distance between concept Ci 
of plan 1 and concept Cj of plan I2 is the number of arcs 
separating both concepts. 


In the end, as shown in figure 6, we proceed to generate a 
hierarchical structure made up of three levels: the first level 
represents the contexts; the second represents the concepts 
while the third represents the data. 


Indoor Outdoor Contextual Level 
Vegetation 


Conceptual Level 


Data Level 


Figure 6: organization result 


4. Search System Framework 


Fig.7 shows the overall architecture of the proposed video 
search system. It breaks down according to the following 
functional phases: 
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Figure 7: Conceptual Architecture of Search System “VISEN” 


1. Indexing phase (section 3) 

2. Indexation of the query: equiprobable weighting between 
terms. 

3. Query expansion: Generally, the most intuitive way for a user 
is to express a text query with keywords that define the content 
of the videos is looking for. In this context, being based only on 
keywords for retrieval in Arabic, French and English is 
considered insufficient because the keywords used in the query 
can be compared to the documents in the database, differences 
on several levels, for example: 

- Morphological variations 

- Lexical variations (different words are used for the same 
meaning) 

- Semantic variations 

Subsequently, the system analyzes the text query and translates 
it into concepts. The problem at this level is how to transform a 
textual query into concepts. To solve this problem we propose to 
use the query expansion technique based on a domain ontology 
to help the user formulate his query. The use of ontology for the 
user’s query enrichment (expansion) can be a solution (among 
others) to resolve the problem of semantic variations. Indeed, 
ontology offers resources shaped as semantic relations, which 
results in the improvement of the research results. Furthermore, 
the use of a morphological analyzer can be sufficient for the 
resolution of the morphological and lexical variations. 

It should be noted, however, that the pertinence of the results 
obtained by an IR system does not depend only on the matching 
process (between query/documents), but also on the pertinence 
of the query, hence, it is necessary to reformulate the query 
using two approaches in an IR system, one is direct and the 
other is indirect. 


As part of the user's assistance by improving his query by using: 
A morphological and lexical analyzer (in our case a descriptor 
vector) and a semantic resource (in our case Wordnet and our 
own DCM ontology) for direct reformulation.. 


4. Description of DCM ontology concepts: The same enrichment 
method was used (Vector descriptor and WordNet) 

5. Matching by making a comparison between the query term 
description (after enrichment) and the terms that are assigned to 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 


describe each concept among the concepts of our ontology 
and return the concepts that have similar descriptors by 
using the Jaccard distance: 


Jaccard(D,9,D_) = a (8) 
where 
Drq are the terms that are assigned to describe those of the 
query. 


De are the terms that are assigned to describe each concept of 
the ontology. 

6. Concept refining: We make a projection on the DCM 
ontology to do a refining by using equation 8 of section 3.4, 
meaning that starting from the identified concepts in the first 
step, we work to identify other concepts that have semantic 
relations with the latter and that can seem important for the 
user. 

7. The user’s choice: The user intervenes to manually choose 
the concepts matching his needs. 

8. Concept comparison: Matching these concepts with those of 
the whole collection is carried out following the way of a 
vector model: 


Simi(req,V;) = cos(req, Vi) = 


n 


Pcj(Vi)*Pcj;(req) " 
J Ep- 0k Q0)? | X. Per(req))? 
j=l 
where : 


v; : Video i. 

req : Request. 

Pcj(v;) : The concept j weight in video i. 

Pcj(req) : The concept j weight in request. 

9. Retrieval results. 

10. Relevance Feedback Method: The relevance feedback 
method presented in this section is based on Rocchio's model 
[20] to rephrase the query. This model is summed up as 
follows: Either a set of information extraction operations 
initiated by an initial query Qo which is, then, modified 
according to the system product outlets. O1 is the obtained 
modified query which is the closest to the optimal user's 
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query. The efficiency of this process will depend on the query 
quality of the initial query and the level of convergence of the 
successive iterations towards an optimal query. The application 
of the Rocchio’s formula on our context is as follows: 


mi p, n2 NP; 
Q -0 a. gy m (10) 
i=1° Jer ? 


where 

Qlis the vector of the new query 

Q0 is the vector of the initial query 

P: is the vector of concepts matching the pertinent videos that 
are restituted and assessed 

NP, is the vector of concepts matching the non-pertinent videos 
that are restituted and assessed. 

n1 is the number of restituted and assessed pertinent concept 

n2 is the number of restituted and assessed non-pertinent 
concepts 

a is a positive parameter of weighting the concepts of videos 
judged to be pertinent 

P is a positive parameter of weighting the concepts of videos 
judged to be non-pertinent 

Illustration of relevance feedback process 


5. User interface 


To clearly illustrate the features offered by our VISEN system, 
we present the following scenario: 

Let us suppose that the user types "Person" as query. VISEN will 
search the concepts the most pertinent to this query. In order to 
refine the query, a matching between the description of the 
ontology concepts and the query term is carried out using the 
equation 8 (initial concept selection) and equation 5 (to refine) 
presented respectively in sections 3 and 4. For instance, the 
returned result is the following concepts: Actor, Adult, etc. 

Let us also suppose that the user selects the concept ‘Actor’ 
among the concepts presented by VISEN. VISEN applies, then, 
equation 9 presented in section 4, in order to present the video 
search result. Fig.8 presents the search results ordered by level of 
pertinence. 


Š VISEN: Video Search Egine eoe 
Fle Help 


Query 


wV¥xVvox Vv 


Figure 8: Result of the textual query 


A relevance Feedback mechanism is allowed by the VISEN 
retrieval interface. In fact, in case of dissatisfaction with the 
result, the user will be able to choose those that are pertinent 
or non-pertinent for a given query. The system, then, 
recalculates the weight of the concepts by taking into 
consideration the user’s indications and applying equation 10 
presented in section 4. 

By a simple click on the plan itself, the user can access the 
video relating to this plan. Similar images of the latter are 
displayed by using equation 6 and 7 of section 3. 


Figure 9: Audio-visual document player 


6. Experimentation 


6.1 Data Sets 


At TRECVID 2015 Semantic Indexing task, there are two data 
sets provided by the National Institute of Standards and 
Technology (NIST): a test and a development data set. The 
development data set IACC.2.tv15 contains 3200 Internet 
Archive videos (50GB, 200 h) while the test data set IACC.2. 
contains approximately 8000 Internet Archive videos (50GB, 
200 h). IACC 2. is annotated with 130 semantic concepts 

The experimentation involves three main steps, which are: an 
experimentation of our VISEN prototype, an experimentation 
of the indexing (concept weighting pertinence) and an 
experimentation of the retrieval phase. 


6.1 DCM Method 


To better explain the Automatic detection of concepts results, 
we have presented them following the histogram presented 
in Fig.10, which represents the weighting level of the most 
pertinent concepts. These weighting levels are variable 
according to the systems. We have compared our proposed 
method to other works proposed. It is clear that the global 
concepts are the most pertinent ones (context) as they help 
cover the whole corpus. 
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Figure 10: Comparison of the average precision of the [23],[24],[25] tools and proposed SVI_VISEN in the TRECVID 2015 


6.2 Retrieval Phase 


6.2.1 Comparison with other systems 


In the 2"4 step of the assessment, we will compare our system 
with the most pertinent semantic retrieval systems. The 
following figure (Fig.11) shows some query outcomes. 
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Figure 11: Precision value 


Based on the histogram (Fig.11), we note that accuracy values 
corresponding to the sports and vegetation concepts are equal to 
1. If compared to accuracy values corresponding to the works 
proposed by [13] and the one proposed by [28], we can see a 
significant improvement. 
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We observe that all the accuracy values exceed 0.85, which 
means that the improvement encompasses all the concepts. It 
is broadly clear that the suggested Interactive Search 
technique improves the system performance. 

Fig.12 presents the comparison of interactive video search 
results for 24 topics performed by 36 users of the present-day 
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video retrieval systems. The VISEN results are indicated with As it emerges from figure13, the first eight questions assess 
special markers. the system usefulness. Questions 9-15 assess the participants’ 
satisfaction with the quality of the information associated 
with the system. Questions 16-18 provide a rating for the 

1. one or more emergency vehicles in motion) -+ i kc De sd interface quality. 

2. tall buildings and the top story visible HOM A user's study of 25 participants is performed to evaluate our 
* VISEN system. The participants taking part in the experiments are 
students from computer Science University who are familiar 
with video search engines like “YouTube” and “Dailymotion”. 
After the use of our system, the participants must fill a form 
containing the following questions. 
The notes are comprised between 1 and 7: 1 ‘strongly 
8. multiple people in uniform and in formation: disagree’ and 7 ‘strongly agree’. 
9. US President George W. Bush, Jr. walking}. 


Interactive Search Results 


3. people leaving or entering a vehicle. 
4. soldiers escorting a prisoner! 


5. a daytime demonstration with one building visible 
6. US Vice President Dick Cheney 


7. Saddam Hussein with another persons face visible b. „sss. . + 


10. soldiers with weapons and military vehicles 


Strongly Strongly | 

4 ` disagree E 
11. water with boats or ships 1, Overall, Iam satisfied with how easy it is to use this system. [s] (uit a Wea flo ao ato) 
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12. people seated at a computer, display visibl í 3 Loan effectively complete my work using this system. fal 006000680 
" Š + Lam able to complete my work quickly using this system. [s] DO. 8 pn ao H 
13. people reading a newspaper z H 5. Lam able to efficiently complete my work using this system. o 8:059309302:0. 
14. a natural scene; no buildings, roads, or vehicles E i {fee comfortable using this system. [s] ünornmno^o 
N Tt was casy to team to use this system. ia) D-0-0-8-0-0 
15. one or more helicopters in flight $ B 8. [believe I became productive quickly using this system. [s] ü00gosg 8h 
2 b 9, The system gives error messages that clearly tell me how to fix problems. [5] B-0-0:90-0-8 
o 16. something burning with flames visible. t 10. Whenever I make a mistake using the system, | recover easily and quickly. ia) 009) n nmt 
- H 11. The information provided with this system is clear. ia) 8530958593998. 
17. people dressed in suits, seated, and with flag IN 12. tis easy to find the information | needed. [2] 000000 
H 13. The information provided for the system is easy to understand. o 5 0 00G 
18. a person and at least 10 books S 14. The information is effective in helping me complete the tasks and scenarios. A 000000 
] SS 15, The organization of information on de system serens is dear. B 0g"0'0'0'0'd 
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1 3 a) $ 7 


22. Condoleeza Rice 

23. soccer goalposts L.. . . r 

24. scenes with snow... o oae Figure 13: The Computer System Usability 
Questionnaire “CSUQ” 
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Fig.14 summarizes the CSUQ questionnaire participants' 
Figure 12: Comparison of VISEN scores. Subsequently, we provide a graphical representation 
(in the form of a histogram) of these answers. An 
interpretation of this test finding is included at the end of this 
section. It should be recalled that we used a 1-7 scale with 7 
being the best possible score and "1" the wrong answer. 


6.2.2. The user’s analysis 


Our proposed survey for the evaluation of our VISEN system is 
inspired by [34]. 


m [15] 
E [28] 


m [35] 


B VISEN 


Figure 14: Comparison between the average score of [15], [28], [35] and our VISEN system of 25participants 
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The histogram (Fig.14) shows that the maximum votes were 
concentrated on the middle scores between 5 and 7. Based on the 
result of this experiment, it can be concluded that using our 
system does not entail any major drawback. Nevertheless, there 
is a need to improve some points, such as concepts similarity. 


The 2" step of the assessment, we will compare our system with 
some semantic retrieval systems (Fig.15). Since the score of our 
results is 85.85% of the overall user's satisfaction, 78.14% of 
System usefulness, 81.42% of information quality and 82.85% of 
interface quality, our VISEN scheme has the highest score among 
the other state-of-the-art systems. In fact, our proposed VISEN 
system outperforms those of Ben Hallima and Hamroun [13], U. 
Rashid [15] and M. Hamroun [28, 36] 


Having analyzed the obtained test results, we can affirm that our 
video search system has proven reliable and efficient. These tests 
enabled us to value the performance of the formula used for 
concept weighting as well as the formula used for inter-concept 
similarity calculation. We may conclude that our system 
managed to achieve our goal to a certain extent. 


m [15] 
m [28] 


m [35] 


B VISEN 


Figure 15: Comparison between the score value of [15], 
[28], [35] and our VISEN system 


7 Conclusion 


The general framework of our work, in this paper, is the 
automatic indexing of video based on its semantic content. In 
fact, we have proposed a semantic indexing model Our 
contribution in this framework is based on several approaches, (i) 
DCM classification method for better indexing of content, (ii) 
new query expansion and relevance feedback methods putting 
the user at the centre of the retrieval process. We implemented a 
prototype entitled "VISEN" to validate our video semantic 
indexing models. The developed prototype enables users to easily 
access the desired video. To test our VISEN (video retrieval 
system) prototype, we used the audio-visual corpus 
(TRECVID2015), which is characterized by its size and the 
importance of its heterogenized content. 

Our aim, as a first perspective, is to merge low-level descriptors 
and high-level in the retrieval process, i.e., which implies that the 
user can indicate the rate that a retrieval must be based on visual 
or/and semantic. The second part of our perspective is about 
considering special relationships between concepts. Indeed, for 
the moment, our automatic system detects the concepts without 
considering any relation between them. It will be interesting to 
create special relationships between the concepts or objects like 
with Belz et al.’s works [33]. For example, "Singing" and 
"walking" are concepts of the human actions. The concept of 


"cycling" is defined in TRECVID as "a person riding a bicycle". 
Although both a "bicycle" and a "person" exist in Fig.16, this 
image does not fit "Bicycling" because the person is not riding 
the bicycle. These indicators are important for some concepts, 
which indicates that not only the detected concepts are 
important but also their special relationship. 


Figure 16: Example of VISEN error 
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ABSTRACT 


This paper describes a data mining study of a set of ancient scripts 
in order to discover their relationships, including their possible 
common origin from a single root script. The data mining uses 
convolutional neural networks and support vector machines to 
find the degree of visual similarity between pairs of symbols in 
eight different ancient scripts. Among the surprising results of 
the data mining are the following: (1) the Indus Valley Script is 
visually closest to Sumerian pictographs, and (2) the Linear B script 
is visually closest to the Cretan Hieroglyphic script. 
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1 INTRODUCTION 


The data mining work in this paper is motivated to help decipher 
ancient scripts such as the still undeciphered Indus Valley Script 
[24]. The idea is that if an undeciphered script can be matched 
with an already deciphered script, then the phonetic values of the 
symbols in the deciphered script can be reasonably expected to 
match the phonetic values of the corresponding symbols in the 
undeciphered script. 

We applied various data mining methods to compare and ana- 
lyze the relationship among the following ancient scripts: Brahmi, 
Cretan Hieroglyphs, Greek, Indus Valley, Linear B, Phoenician, 
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Proto-Elamite, and Sumerian Pictographs. Our data mining yields 
a script family tree with a common origin of all these scripts. A 
particularly interesting finding of our data mining is that the Indus 
Valley Script seems to derive from the Sumerian Pictographs. Our 
finding is supported by the following observations of other authors. 
First, it is known that intensive trade existed, mainly by sea between 
the ancient civilizations in Mesopotamia and the Indus valley, and 
the urbanization, irrigation technology, social organization, com- 
mercial patterns, and numerous other features of the Indus Valley 
civilization bears a close resemblance to the Sumerian model [4, 9]. 
Second, the ancient Sumerian records referred to the Indus Valley 
Civilization as Meluhha, which means “high country” in Dravid- 
ian languages according to Parpola [24] and may be related to the 
present day region of Baluchistan. 

The rest of this paper is organized as follows. Section 2 describes 
the dataset of the ancient scripts and texts which we used as a data 
source. Section 3 describes the machine learning methodologies that 
we used for the computerized comparison of the visual characteris- 
tics of pairs of symbols from the different scripts. Section 4 presents 
the experiments and results and analyzes the findings. Section 5 
discusses related work. Finally, Section 6 gives some conclusions 
and directions for further research. 


2 DATASET 


In this section, we provide the historical background for all the 
scripts used in this work. We also describe how the datasets were 
created for the computations. 


2.4 Brief Review if the Eight Scripts Considered 


21.1 Brahmi. Brahmi is the second oldest South Asian script, after 
the Indus Valley Script. The Brahmi script is an abugida, which uses 
a system of diacritical marks to denote vowel association with the 
consonant symbols. The direction of writing for the Brahmi script 
is left to right. Much like the Indus Valley Script, the Brahmi script 
has a debated origin. 

T 0^uULd 8 ERP HCOF GTA O 

? DILL OTELI ds Hb bbe 
Figure 1: Sample Brahmi script symbols. 


2.1.2 Cretan Hieroglyphs. Cretan Hieroglyphs was the first writing 
of the Minoans and predecessor to Linear A, which in turn gave 
rise to Linear B and Cypriot. It was used between 2100 to 1700 BC 
[2, 23]. The second author proposed recently a decipherment of 
Cretan Hieroglyphs [36], but there are many alternative proposals. 
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evih2od mw || n a 
As&sSHOTTSUVIY a 
Figure 2: Sample Cretan Hieroglyphs. 


2.13 Greek. There were many variants of the early Greek alpha- 
bet, each suited to a local dialect. Eventually, the Ionian alphabet 
was adopted in all Greek-speaking states. Ancient Greek is a full 
(consonants and vowels) alphabet. Greek was written from around 
800 BC to the 5th century in both a right-to-left and a boustro- 
phedonic style, but later it transitioned to a left-to-right writing 
system [5]. 


ABT BRREFIBOIKSPL 
EOPMOPZTYOXYQ 


Figure 3: The 26 letters of the ancient Greek alphabet. 


2.1.4 Indus Valley. The Indus Valley Script is an undeciphered 
script, which was used between 2400 and 1900 BCE [25]. It is stated 
to be a logographic and syllabic writing system, written from right 


AXE TT TK OR 
EST/XA/ANT 
000D)8àU 


Figure 4: Sample Indus Valley script symbols. 


2.15 Linear B. Linear B was used in Mycenaean Greece and is 
the oldest known Greek writing [15]. Linear B remained a mystery 
until 1952 when Michael Ventris deciphered Linear B showing 
that it is an archaic version of Greek [3]. Linear B is a syllabic 
writing system where in general each syllable begins with a single 
consonant, which is followed by a single vowel. 


TEE Rt OF 2 AB 
PEOHPTATTE 


Figure 5: Sample Linear B symbols. 


2.1.6 Phoenician. The Phoenician alphabet was used from 1200 to 
150 BC in the eastern Mediterranean [13]. The Phoenician alpha- 
bet is an abjad (only consonants with no vowels) writing system, 
written from right to left, which consists of 22 letters represent- 
ing consonants [13]. The Phoenician alphabet may derived from 
Egyptian Hieroglyphs [16] or Linear B [35]. 
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«s3"1d4a4YrBeads 
LZ4™4#O7¥ P4WX 


Figure 6: The 22 letters of the Phoenician alphabet. 


2.1.7 Proto-Elamite. The Proto-Elamite script was briefly used 
between the end of 4000 to the beginning of 3000 BCE in present- 
day Iran and southern Iraq [11]. The script uses around 1900 non- 
numerical signs, although 1700 of those signs only appear a maxi- 
mum of nine times in the 1600 Proto-Elamite texts [8]. The Proto- 
Elamite script is said to be logographic or ideographic [11] and is 
also considered undeciphered. 


— =Z ls eee 
T] S DD cr» Joe ma t 


Figure 7: Sample Proto-Elamite script symbols. 


2.1.8 Sumerian Pictographs. The Sumerian language is distantly re- 
lated to both the Uralic and the Dravidian language families [28, 41]. 
However, the Sumerian Pictographs are considered an independent 
development by most researchers [11]. The Sumerian pictographic 
script is primarily a syllabic and logographic writing system. It was 
written from left to right, and it and its cuneiform descendant were 
used from 3100 BCE to 1st century AD [11]. 


| aE -E Gd 2» XK 4 € »o—— 
Da C dilcC OE E M 
tp Oo f # OO + 
B*aé oe» -( 


Figure 8: Sample Sumerian Pictographs. 


2.2 Data Source 


The eight different scripts outlined in the previous section were 
used as a data source. For the Brahmi script we use 34 of the symbols 
(Figure 1), for the Cretan Hieroglyphs we use 22 symbols (Figure 
2), for Greek we use all 27 symbols (Figure 3). For the Indus Valley 
Script, we use 23 symbols (Figure 4) which were symbols with the 
highest frequencies because the Indus Valley Script has at least over 
400 symbols and symbols that occur only once or twice are likely 
to be insignificant [44]. For Linear B we use 20 symbols (Figure 5), 
for the Phoenician alphabet we use all 22 symbols (Figure 6), for 
the Proto-Elamite script we use 17 symbols (Figure 7), and for the 
Sumerian Pictographs we use 34 symbols (Figure 8). 
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2.3 Data Gathering and Processing 


Our dataset is modeled after the MNIST image database [21]. Each 
symbol in our dataset has 780 training images and 120 validation 
images, that is a total of nine hundred images associated with 
each symbol. The images used were hand generated and computer 
modified via minor skewing and distortion. Each image is 50x50 
pixels, grayscale, and centered in the 50x50 region using the center 
of mass. These features are the necessary preprocessing steps for 
each dataset. 


3 SOFTWARE ARCHITECTURE 
3.1 Convolutional Neural Network (CNN) 


We created neural networks using Python and TensorFlow with a 
Keras wrapper. The constructed neural networks have various levels 
of accuracy, depending on the script learned. The architecture of 
our convolutional neural network is similar to the LeNet model [20] 
with a modification on the output classification as shown in Figure 
9. The main deviance from the original LeNet model is that we 
use an SVM classifier for the final dense layer instead of a Softmax 
layer. Previous works have shown this to be useful in recognition 
of other languages [10] or even when the sample set is more than 
ten [21]. 

Starting with our image size of 50x50 we first apply a convolution 
using a filter size of 5x5, which reduces our image to 46x46. After 
this we apply a pooling layer which reduces our image size by half, 
entailing a 23x23 image. We then add one more convolution layer 
using a 4x4 filter which reduces our image size to 20x20. Then we 
apply a pooling layer which reduces our image in half again to 10x10. 
We then pass the image to a fully connected flattened layer of 1024 
neurons, which then passes the data to our SVM (see Section 3.2). 
Each convolution layer has a Rectified Linear Unit (ReLU) activation 
function. ReLU is often used as the activation function of choice for 
most CNN architectures. The ReLU activation function produces 
zero as an output when x < 0 or it produces a linear value with 
slope of one when x > 0. Each pooling layer employs max pooling. 
Each 2x2 filter takes the maximum value of the four quadrants to 
use for the feature map. To combat overfitting we use a drop rate of 
0.4. Each CNN uses the Adam optimizer with learning rate of 0.001. 
Adam is an adaptive learning rate optimization algorithm that was 
designed specifically for training deep neural networks [19]. 


3.2 Support Vector Machine (SVM) 


The generated SVM is implemented in Python and uses Python 
library packages. SVMs were designed for binary classification. 
However, in our research, we use SVMs for a multiclass problem. 
Generally, for classification problems in CNNs, the last layer uses 
Softmax. In this research, we use L2-SVM which is differentiable 
and optimizes the sum of the squared errors. The L2-SVM also 
minimizes the squared hinge loss. The optimization function for 
the L2-SVM is shown below, where w is an N-dimensional weight 
vector, b is the bias terms, and £; are slack variables, and C is the 
penalty parameter. 
Minimize: 


1 cŠ 
zl" *25,8 (1) 
i=1 
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Subject to: 
yi(xj-w+b)>1- & DIT N (2) 


As mentioned previously, training the classifier using the L2- 
SVM objective function outperforms other methods such as L1-SVM 
or Softmax regression [48]. 


3.3 Prediction Classifier 


In addition to creating a CNN+SVM classifier per each script, we 
also look at the similarities between two pairs of scripts. The trained 
CNN+SVM model for every script is passed into the other seven 
script models. The basic idea of the predictive classifier is illustrated 
in Figure 10. 

Each similarity matrix produced by the CNN+SVM for the eight 
scripts has different NxM dimensions based on the number of sym- 
bols in each script. We create the following two measures to see 
the strength between two scripts: 


(1) The Average of All takes the average of the strongest prob- 
ability matches for each symbol pair. 

The rationalis that taking the average ofthe strongest matches 
between two scripts takes into account all the symbols in 
each script. If a symbol provided as input has a low corre- 
lation with all of the trained symbols, the overall average 
would reflect this. 

(2) The Selective Average only considers pairs of symbols 

which have higher than seventy-five percent similarity match 
and then take the average. 
The rational is that the selective average provides two mea- 
sures in regards to the similarity of two scripts. It provides 
not only a higher overall average in comparison to taking 
the average overall but also the number of symbols which 
are the closest together. The selective average also takes 
into account that a script may not completely stem from 
only one script. Therefore not all symbols may have a high 
correlation. 


3.4 Classification Trees 


Each CNN+SVM for the prediction classifier has seven mappings 
for the eight scripts. The strength between the scripts is provided 
using the two averages presented in Section 3.3. In addition, we 
take into account the number of symbols between the scripts which 
have a correlation value > 75%. 

To create a classification tree we employ two different algorithms 
for the two measures as listed below. 


(1) Similarity: The scripts which have a higher correlation are 
paired. We use WPGMA (Weighted Pair Group Method with 
Arithmetic Mean) to create our dendrogram for the scripts. 
The WPGMA algorithm creates a dendrogram that displays 
the structure in the similarity matrix. The nearest two clus- 
ters are combined at each step i.e. clusters x and y are com- 
bined to create x U y. Then the distance to another cluster z 
is the mean of the distances between z and x U y as shown 
in Equation (3). Since we use a similarity matrix as input to 
the WPGMA method, we use the complement of the matrix. 
That is, now the smaller values indicate higher similarity. 
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my SVM m 


Convolution Pooling Convolution Pooling Fully Output 
Connected 
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Stride 2 Stride 2 1024 


Figure 9: The architecture of our CNN+SVM classifier. 


Trained Nx22 


Unknown — 


Script 


Phoenician CNN+SVM 


Similarity Matrix 


Figure 10: The predictive CNN+SVM classifier comparing the other seven scripts to the Phoenician alphabet. The unknown 
script is replaced with any of the seven other scripts. The size of the matrix is dependent on the number of symbols in the 
unknown script provided. 


4 EXPERIMENTAL RESULTS AND ANALYSIS 


Axuy),z = cae (3) In this paper, we have three main fundamental building blocks: the 
dataset creation, the CNN+SVM classifier, and the hierarchical tree 
creator. The latter two portions were first independently verified 

(2) Hierarchical: The scripts are ancestor/descendant of another and then combined to create a final product. 
script. The hierarchical tree generation is implemented again 
using WPGMA but also considering the time period when 4.1 Validation of the Script Classifier 


each script was used. By doing this we can create a descen- 
dant tree, which highlights the possible descendant of each 
script. The details are shown below in Algorithm 1. 


Each script has its own CNN+SVM classifier. The accuracy of the 
different scripts is shown in Table 1 with an increase of epochs 
(step size = 25). We see that for all the scripts at 25 epochs we have 
already reached the 90% accuracy, similarly to MNIST CNNs. 


Algorithm T lime Based Descendant Toe 4.1.1. Script Prediction. For each script, its CNN+SVM classifier 


1: Create parent node P has an almost perfect accuracy at 100 epochs. Due to that, we see 
2: Create a node for each script whether the CNN+SVM can be used to find ancestors and/or descen- 
3: for all Closest Script Pairs Sx and Sy do dants of other scripts. We partition this experiment into two cate- 
4 — ifS, Time > Sy.Time then gories: Known Origin and Unknown Origin The known origin 
5 Parent of Sx is P scripts validate our framework and ensure that our tool is capable 
6 Parent of Sy is Sx of reproducing established results. Some specific categorizations: 
i else (1) Known Origin: Phoenician is the ancestor of ancient Greek, 
8 Parent of Sy is P as mentioned already by Herodotus, and Brahmi, via Ara- 
2 Parent of Sx is Sy maic. Cretan Hieroglyphic script an ancestor of Linear B. 
10: for all Singleton Scripts Sz do (2) Unknown Origin: The Sumerian Pictographs, the Indus 
11: Parent of S, is P Valley, and the Proto-Elamite scripts have unknown ances- 


return Tree 
tors and descendants. 
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Table 1: Validation Accuracy 


Number of Epochs 
25 50 75 100 
Brahmi 95.09 98.15 98.24 99.35 
Cretan Hieroglyphs 91.09 92.84 94.47 97.53 
Greek 93.49 96.26 97.23 98.63 
Indus Valley 93.50 95.70 96.85 98.23 
Linear B 91.19 93.15 96.42 99.48 
Phoenician 93.18 94.77 95.36 97.52 
Proto-Elamite 91.93 94.55 97.05 99.09 


Sumerian Pictographs 90.79 93.21 96.94 97.40 


4.1.2 Validation of Our Method - Known Script Prediction. By using 
the prediction techniques we aim to see the similarities between 
the scripts. We first validate our thoughts by passing Greek into the 
trained Phoenician CNN-SVM and vice-versa. Similarly, we repeat 
this experiment with Linear B and the Cretan Hieroglyphs. 

As seen in Figures 11 and 12, the heatmaps of the similarity 
matrices between Phoenician and Greek indicates high correlation 
on the diagonal. This indicates the Phoenician and Greek have an 
almost one-to-one mapping. We see that this result is validated by 
the known mapping between Greek to Phoenician as shown in Table 
2. We find similar results with Linear B and Cretan Hieroglyphs, 
which also indicates that the Cretan Hieroglyphs and Linear B have 
an almost one-to-one mapping. 


4.1.3. Unknown Origin Script Prediction. Since the CNN+SVM pre- 
dictor worked well on the known origin scripts, yielding the ex- 
pected ancestor-descendant relationships, we can safely use it for 
the unknown origin scripts too. As visualized in Figure 13, the 
Sumerian Pictographs and the Indus Valley script have a fairly 
strong correlation and an almost one-to-one mapping similar to 
the relation between Phoenician and Greek and between Cretan 
Hieroglyphs and Linear B. Table 3 notes the number of symbols 
which have a > 75% correlation between scripts. 


4.2 Tree Visualization Analysis 


The similarity matrices shown in the previous sections produce the 
classification and hierarchy trees as shown in Figures 14 and 15, 
respectively. 


4.2.1 Classification Tree. Beside confirming the known origins 
noted earlier, the classification tree generated some interesting new 
results. In particular, Brahmi is closest to Phoenician and Greek. 
The visualization also shows that Brahmi, the Cretan Hieroglyphs, 
Greek, and Linear B and Phoenician form one branch of the classi- 
fication tree, while Sumerian Pictographs are closest related to the 
Indus Valley script. 


4.2.2 Hierarchy Tree. The hierarchical tree not only shows the 
similarity between two pairs of scripts but also visualizes that Greek 
is a descendant of Phoenician and Linear B is a descendant of Cretan 
Hieroglyphs. In addition, the Indus Valley script has been classified 
as a possible descendent of the Sumerian Pictographs. Brahmi and 
Proto-Elamite have an unknown ancestor. However, they have some 
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Table 2: Mapping between Greek and Phoenician. 


Phoenician Greek 
« aleph A alpha 
3 beth H beta 
" giml r gamma 
[X| an| B delta 
a he E epsilon 
Y Waw F or Y digamma or upsilon 
L zayin I zeta 
H heth B eta 
e teth © theta 
1 yodh | iota 
^ kaph K kappa 
Z lamedh ^ lambda 
" mem r mu 
4 nun ~ nu 
2: samekh t xi 
O ayin O omicron 
rar a pi 
v sade ^ san 
9 qoph ? koppa 
“| res P rho 
W sin Z sigma 
X taw T tau 

- o phi 

- X chi 

: Y A 

: Q omega 


similarities to the other scripts to assume an unknown hypothetical 
common origin of these eight scripts. 


5 RELATED WORK 
5.1 Background - Indus Valley Script 


Sir Alexander Cunningham, one of the first to encounter the Indus 
Valley script, assumed that the seals were foreign import. He later 
stated that Brahmi might be a descendant of the Indus Valley script. 
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Greek input into Phoenician CNN+SVM 


1.0 


beta 
gamma 
delta 
epsilon 
digamma 
upsilon 
zeta 
eta 
theta 
iota 
kappa 
lambda 
mu 

nu 


0.8 


0.6 


xi 
omicron 
pi 

san 
koppa 
rho 
sigma 
tau 

phi 

chi 

psi 
omega 


Figure 11: The Greek letters are provided as input to the trained Phoenician CNN+SVM. 


Table 3: The number of symbols with correlation > 75% between each pair of the eight scripts. 


Brahmi | Cretan Hier. | Greek | Indus Valley | Linear B | Phoenician | Proto-Elamite | Sumerian Pict. 
Brahmi 34 - - - - - - - 
Cretan Hier. 2 22 - - = - - - 
Greek 9 26 - - - - - 
Indus Valley 8 5 9 23 - - - - 
Linear B 3 20 7 20 = - > 
Phoenician 9 6 22 9 9 22 - - 
Proto-Elamite 2 2 2 3 17 - 
Sumerian Pict. 6 6 7 20 5 7 3 39 
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Phoenician input into Greek CNN+SVM 
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Figure 12: The Phoenician letters are provided as input to the trained Greek CNN«SVM. 


Many other scholars have connected the Indus Valley script to 
Brahmi [29-31]. Many scholars also suppose that the Indus Valley 
Script expresses some Dravidian language [24, 26, 27, 43, 45, 46, 49], 
where the work from [49] was one of the first publications using 
computer aid to analyze the Indus Valley script. 

Some scholars, such as McAlpin [22], support the Elamo-Dravidian 
hypothesis, which links the Dravidian to the Elamite languages. 
McAlpin also believes that the Indus Valley script could be part of 
the Elamo-Dravidian language family. That hypothesis is supported 
by evidence of extensive trade between Elam and the Indus Valley 
civilization. 

There are a few scholars who believe that the Indus Valley script 
is not a language [12]. These scholars say that the Indus Valley 
script is comparable to nonlinguistic signs which symbolize family 
or clan names/symbols and religious figures/concepts. Regardless 
of it being a language or not, its similarity to the other scripts still 
suggests that the symbols were derived from Sumerian Pictographs. 

Nevertheless, the brevity of Indus texts may indeed suggest that 
it represented only limited aspects of an Indus language. That is 
true of the earliest, proto-cuneiform, writing on clay tablets from 


Mesopotamia, around 3300 BC, where the symbols record only 
calculations with various products (such as barley) and the names 
of officials. 


5.2 Machine Learning 


Scholars have used various machine learning techniques to analyze 
and classify images and read text [17, 18]. 

Support vector machines and neural networks have been used 
to recognize a multitude of scripts. Artificial neural networks and 
SVMs were compared on the Devanagari script, a descendant of 
the Brahmi script [1]. Arabic handwritten recognition was recently 
studied using the CNN+SVM combination [10]. In addition, hand- 
written Chinese characters were analyzed using CNNs [14, 47]. 
Earlier work of the authors shows the similarity between the Indus 
Valley script and other scripts using CNNs [6, 7]. However, the use 
of neural networks to generate script families is a new domain. 


5.3 Classification Trees 


Revesz [34, 35] used hypothetical evolutionary tree reconstruction 
algorithms to analyze the development of the Cretan Script Family. 
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Sumerian input into Indus CNN+SVM 
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Figure 13: The Sumerian Pictograms are provided as input to the trained Indus Valley CNN+SVM. 
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Figure 14: The classification tree created from the similarity matrix using WPGMA. 
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Figure 15: The hierarchical tree created by taking time into account. 
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The matching of Minoan Cretan Hieroglyphic and Linear A 
symbols with the Carian and the Old Hungarian alphabets yielded 
new phonetic values for the Cretan Hieroglyphic and Linear A 
symbols. The new phonetic values allowed the decipherment of 
the Linear A script [39], and the Cretan Hieroglyphic script [36], 
including the Arkalochori Axe [40] and the Phaistos Disk [37] 
inscriptions. The AIDA system [42] is an online Minoan inscriptions 
database that also contains some of these translations. 

The origin of languages and scripts have long been studied by 
linguists. The use of genetic information tying civilizations and 
their languages have only recently been studied [32, 33, 38]. Using 
human archaeogenetics may provide new insight into the diffusion 
of human populations in association with various language families. 


6 CONCLUSIONS AND FUTURE WORK 


The invention and spread of writing was a giant step for humanity 
that is still largely shrouded in mystery. However, our data mining 
of ancient script databases revealed several interesting hitherto 
unknown relationships among the eight scripts studied. This work 
is only the beginning of a systematic neural networks-based explo- 
ration of an ancient script family that likely encompasses not only 
the eight scripts that we studied but also many others. Hence as a 
future work, we plan to add to our database other ancient scripts 
from the region of the Near East and the Mediterranean Sea. By 
adding more scripts to our CNN+SVM predictor system, we can 
obtain a more complete tree of visual similarities and reduce the 
remaining uncertainties in the development of one of the oldest 
script families in the world. 
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ABSTRACT 


Geo-distributed analytics is becoming an increasingly common- 
place as IoT, fog computing and big data processing platforms are 
nowadays integrating with each other. In this work, we deal with a 
problem encountered when complex Spark workflows run on top 
of geographically dispersed nodes, either data centers or individual 
machines. There have been proposals that optimize the execution 
of such workflows in terms of the aggregate traffic generated or 
the latency (which is due to data transmission), or both metrics. 
However, the state-of-the-art solutions that target both objectives 
are either significantly sub-optimal or suffer from high optimiza- 
tion overhead. In this work, we address this limitation. The main 
solutions that we propose are both efficient and effective; based on 
either the extremal optimization or the greedy algorithm design 
paradigm, they can yield significant improvements having an op- 
timization overhead of a few tens of seconds even for Spark work- 
flows of 15 stages running on 15 distributed nodes. We also show 
the inadequacy of evolutionary optimization solutions, such as ge- 
netic algorithms, for our problem. 
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1 INTRODUCTION 


Geo-distributed analytics, such as fog computing solutions [1, 22], 
is an emerging area boosted by the maturity of big data analytics 
platforms supporting data streams, e.g., Flink [8] and Spark [2], 
along with the prevalence of IoT devices in modern applications 
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[5, 7, 18]. The execution model builds upon and extends the one in 
distributed [21] and parallel [9] databases. In short, the execution 
plan is typically a directed acyclic graph of operators and benefits 
from the main types of query plan parallelism, namely partitioned, 
pipelined and independent. 

In this work, we consider Spark running over separate physi- 
cal nodes with distinct data transmission capacities; as reported in 
[10], the applications of such a setting span several fields, such as 
climate science, multinational companies, bio-informatics and log 
analysis. Spark execution plan inherently benefits from pipelined 
and partitioned parallelism [3] with the underlying cluster man- 
agement layers, e.g., YARN, Mesos and so on, being responsible 
for the actual runtime task scheduling. A typical assumption is that 
the cluster on which the execution runs is characterized by abun- 
dant memory and fast node interconnection speeds, and the whole 
processing takes place in a single geographical area. However, this 
assumption becomes a limitation, when the data to be processed 
are physically stored in multiple places and/or processing needs to 
occur close to the data source. To overcome this limitation, several 
geo-distribution-aware extensions to MapReduce-based solutions 
have been proposed [10]. 

Optimization techniques for geo-distributed Spark execution pl- 
ans directly affect the manner partitioned parallelism is enforced 
through specifying the portion of the tasks in each Spark stage 
that each processing node should become responsible for. Current 
techniques to this end aim to minimize either the total traffic be- 
tween the nodes, e.g.,[27], or the latency, e.g. [23]. In a recent previ- 
ous proposal of ours, we present bi-objective solutions that target 
both criteria [19]. The proposal in [19] challenges the validity of a 
main motivation behind geo-distributed data flows, namely that it 
is too costly to gather data in a single place, e.g., [10, 16, 28], and 
is tailored to multi-stage workflows rather than simple two-stage 
MapReduce ones. It comprises two techniques: a greedy one that 
is fast but not very effective in terms of the quality of the derived 
solutions and another one based on iterated local search that is ef- 
fective but takes longer time, in the order of couple of minutes, to 
compute the proposed task distribution. 

In this work, we make a twofold contribution. Firstly, we com- 
bine the best attributes ofthe solutions mentioned above. The main 
bi-objective solutions that we propose are capable of running much 
faster that the best performing one in [19] and still yield signifi- 
cant improvements over the main competitor, as evidenced by the 
results of a thorough evaluation. The solution is based on either 
the extremal optimization (EO) paradigm [4, 6] or the greedy algo- 
rithm design strategy but in a less shortsighted than in [19]. Sec- 
ondly, also in light of the well-known No Free Lunch theorem [29], 
it is important to find which optimization paradigm fits better to 
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Figure 1: A real Spark DAG 


our specific problem; to this end, we show that evolutionary opti- 
mization solutions, such as genetic algorithms, are inferior to the 
solutions we propose hereby. 

Paper structure. The remainder of the paper is structured as fol- 
lows. In the next section, we give a motivation scenario. In Section 
3, we give the formal problem definition and we outline the solu- 
tions in [19] to make this work self-contained. We present our new 
solution in Section 4. The evaluation aims to cover a wide range 
of scenarios and is presented in Section 5. We conclude with the 
discussion of the related work and the open issues in Sections 6 
and 7, respectively. 


2 AMOTIVATION EXAMPLE 


Our work is highly inspired by performance issues in modern data 
analytics platforms, such as Spark, which is arguably the most- 
widespread framework for data-intensive cluster computing to date. 
The distinctive feature of our work is that we do not assume a cen- 
tralized, homogeneous setting; on the contrary we consider that a 
cluster may consist of physical machines that are geo-distributed, 
have heterogeneous uplink and downlink speed capacities, and 
communicate through sending data across a network in order to 
complete an application. Our algorithms can make such applica- 
tions run faster by offering a task placement plan that minimizes 
the data transfer over the network, while we consider both of these 
two objectives. Next, we showcase how our algorithms, namely 
Greedy-full and Extremal to be presented in Section 4, can improve 
a Spark application. 

In a geo-distributed setting, it is reasonable to assume that data 
transmission is the dominant factor for the application latency. Fo- 
cusing on the data transmission capacities, we employ three ma- 
chines (noted as M1, M2 and M3 in Table 1) with uplink speeds 
of 5 MB/sec, 2 MB/sec, and 5 MB/sec, respectively. The downlink 
speeds are 5, 3 and 2 MB/sec, respectively. The execution plan of 
the application we try to optimize, in the form of a Directed Acyclic 
Graph (DAG), is a linear one, as shown in Figure 1. Each node 
(bounded rectangle in the figure) is a stage that consists of tasks, 
the placement of which is decided by our algorithms. The edges be- 
tween the nodes represent the data movement between the stages. 
The overall input is set to 287.6 MB and the selectivity between 
the stages is always equal to 1; i.e., the total amount of data being 
reshuffled and flowing across the stages remains the same. 
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We first compute the task allocation offline and then enforce 
the task allocation in Spark. Then, we compare the estimated run- 
ning time reduction with the actual one. The offline computation 
ignores the CPU overhead that a real setting has even when trans- 
mitting data [20] and thus the time it refers to is only the over- 
head of moving data. Note that the data movement reduction is 
the same in the offline computation and the real run. Table 1 shows 
the allocations decided by each algorithm, namely Iridium [23], our 
main competitor, Extremal and Greedy-full (i.e., the contributions 
of this work) for each stage and for each machine. Note that for 
the first two stages of Figure 1, we do not choose a new placement 
because we assume that the initial data placement, on which the 
task placement of the first two stages depends, is fixed; these two 
stages just read and parallelize the initial dataset evenly. Thus we 
start from the task placement of Stage 2. With these new task al- 
locations, Extremal is estimated to achieve a 50.5% reduction in 
the running time over Iridium, while Greedy-full achieves 9.86% 
reduction. The real reduction achieved in the Spark environment 
was 47.75% and 13% respectively, indicating that our algorithms 
can indeed reduce the running time of a real application in Spark. 
More importantly, the amount of data transmitted over the net- 
Work drops by 74.977; due to Extremal (from 799.7 to 200.1MBs) 
and by 25.3% due to Greedy-full (from 799.7 to 597.3MBs). 

Implementation Details. In order to enforce our task placement 
in the Spark engine, we have rebuilt Apache Spark 2.3.2 with the 
following changes; we override the TaskSchedulerImpl class, where 
we disable the shuffling of the offers the executors make for a task 
and we edit the TaskSetManager class to set the task locality to 
* ‘Any’ ' and thus prevent Spark from deciding a placement for the 
tasks based on the data location. Finally, we can easily emulate a 
geo-distributed setting with machines characterized by different 
downlink and uplink speeds by using machines connected to a lo- 
cal network and set the bandwidth limits of the executors using a 
tool, such as the Wonder Shaper script! 


3 BACKGROUND 


We first present the problem statement, which is kept the same as 
in [19], and then we present the two existing solutions, the strong 
points of which we combine in this work. The problem is stated in 
a system-agnostic manner; i.e., it is not applicable to Spark solely. 


3.1 Problem Statement 


A geo-distributed data flow is represented as a DAG G(V, E). Each 
node vj € V, where j = 1... N and N = |V], represents a job and 
each edge represents a shuffle data movement between the jobs. 
For example, in Spark data flows, we consider a job to be a Spark 
stage (note that Spark uses the terminology job to refer to a set of 
stages); in between such stages, data shuffling takes place. Each job 
runs in parallel in M data centers (DCs): i.e., each DC becomes re- 
sponsible for a fraction of the job execution with the magnitude of 
the fraction devised by our algorithms. DCs generalize the notion 
of physical machines used in the motivation example. 
Conceptually, the workload of a job is split into small units of 
work, each allocated to a specific processing element, e.g., a multi- 
core server of a specific DC, as an atomic unit. We refer to these 


lavailable from https://github.com/magnific0/wondershaper 
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Table 1: Task placement decision (proportion of tasks allocated) of Iridium, Extremal and Greedy-full for each stage and 


machine 


; Ü i stage 2 stage 3 stage 4 stage 5 stage 6 
Algorithm/Stage-Machine —.— M2 M3 | Mi M2 M3 | Mi M2 M8 | Mi M2 M3 | Mi M2 M3 
Iridium 0.286 | 0.429 0.285 | 0.188 0.529 0.283 | 0.098 0.628 0.274 | 0.088 0.717 0.245 | 0.208 0.792 0.0 
Extremal 0.0041 0.993 0.0029 | 0.0 0.995 0.005 | 0.003 0.997 0.0 0.002 0.998 0.0 0.001 0.999 0.0 
Greedy-full 0.333 0.477 0.19 0.116 0.606 0.278 | 0.032 0.706 0.262 | 0.0 1.0 0.0 0.0 1.0 0.0 

splits as tasks. Due to shuffling, in the generic case, it is necessary Table 2: Notations used in the paper. 
to move data between DCs before the execution of each task. This 
data movement is the dominant factor regarding the running time Symbol Meaning 
of the jobs, while the actual execution time of the job is considered G(V, E) the data flow DAG 
to be negligible. N,M number of jobs and DCs 
In this work, we deal with the allocation of sets of tasks to each Ji amount of input data of a job vj € V 
DC for each job. Let F be the input dataset size of vj. If the selectiv- qi selectivity of a job vj 
ity of the job is a, then the output dataset is of size S = a! «I/;the Sj amount of intermediate output data of a job (S = 
job selectivity is defined as the ratio of its output to input size. If vj al x [J) 
has outgoing edges in G, S/ is divided into M parts to be sent to the Uj uplink bandwidth on DC i 
jobs downstream, denoted by ns, i=1...M,st.>¥ H = ]. Essen- Di downlink bandwidth on DC i 
tially, rl corresponds to the fraction of tasks of the children nodes of S; amount of intermediate data of vj on DC i 
vj assigned to the i th DC (tasks are assumed to be infinitesimally r fraction of tasks executed on DC i for jobs succeed- 
divisible). In other words, rl values affect the workload allocation j p Ing Uj 
of jobs vg, where (j, k) € E. Overall, each DC has to transfer a frac- TU;, TD! running time of intermediate data transfer on up 
tion of (1— r?) of its local output data S, and to receive a total of mo mi link of d 
j ; j R tot ing ti 
r! « (S) — S7) data from all the other DCs.? Following the rationale (G) Hsc dd " 
„i i : : : : DM(G) total data movement between DCs in G 
in [23], we specify the uplink (resp. downlink) bandwidth of the MNA : 
th ! : : RTj running time of job vj 
i^" DC as U; (resp. Dj). Table 2 summarizes the main notation. ; 
. . . DMj total data movement between DCs of job vj 
Based on the above, the time for a site to send data regarding the ll : holding i h Il ; z 
PRETO a G- Pj : s! Ju; a unc ] allocations | A N x M array holding in each row allocations|j] 
output of a job is PU i ;/ Ui, an e time to receive ther, j=1...N, i2 1... M values 


data is TD; = rl x (Sj — S )/Di. The running time RT; of vj is 
max{TU), TS ) 

The total data movement from a node vj is equal to DM; = 
XM - r?) * sS. The total data movement is DM(G) = x DMj, 
where vj has at least one outgoing edge. 

The running time of a G, RT(G) is the maximum sum of RT; 
values across any path from a source job (vj without incoming 
edges) to a sink one (vj without outgoing edges); sink nodes have 
zero running time by default. 

More formally, the problem we target is defined as follows: 

Problem Statement: Given a dataflow G, a fixed distribution 
of the initial data across M DCs, and a running time value RTbase, 
compute the rl values s.t. DM(G) is minimized and RT(G) is al- 
ways less than (1 + e)RTbase, where e is a small constant e > —1. If 
0 > £ > —1, then we enforce the solutions to seek improvements 
regarding both DM(G) and RT(G); when e is positive, we tolerate 
increases in RT(G) compared to RTbase. We can also regard posi- 
tive values of e as the percentage of the performance degradation 
that is tolerated. 


?Note that in general, s] * i SJ, i.e., the distribution of the intermediate results in a 
job is not necessarily the same as the way these results are shuffled in the next jobs. 


However, assuming a uniform distribution of results, it holds that sj - mean(rk )* 
SJ, where (k, j) € E. 
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Note that the higher we set e, the more the problem tends to 
be a single-objective optimization (that of minimizing DM(O)) in 
practice. 


3.2 Existing Solutions and Limitations 
In [19], a two-step approach was followed: 

(1) Use Iridium [23] as the guideline for the initial assignment 
of tasks, i.e., computation of the rl values, to the DCs. Irid- 
ium decides the allocation for each job separately, after per- 
forming a topological sorting on G, and considers the nodes 
from the upstream to the downstream ones. In this way, 
RTbase is derived. 

(2) Re-arrange the allocations with a view to decreasing the to- 
tal movement cost while not allowing running time degra- 
dation more than ¢ times. 


Then, for the second step, two techniques were proposed. The 
first one, is a fast greedy one. In the next section, we introduce 
another greedy technique explaining the differences. The second 
one is an Iterated Local Search (ILS) algorithm that uses Stochas- 
tic Hill Climbing (SHC) internally. It randomly perturbs the ini- 
tial solution, and then looks for additional randomly chosen small 
changes in the perturbed configuration, so that DM(G) improves, 
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Algorithm 1 Greedy-full algorithm 


Require: allocations, RTthreshold, DM(G), RT(G), iterations 
bestAllocations < allocations 
bestRT — RT(G) 
bestDM — DM(G) 
for i —1 to iterations do 
for each job do 
bottleneckDC + findBottleneckDC(job) 
Reallocate tasks regarding the current job through dis- 
tributing a proportion of f) of bottleneckDC's fraction to 
the other DCs 
tempAllocations —— apply changes to all downstream jobs 
inG 
Calculate RT(G)’ using tempAllocations 
Calculate DM(G)' using tempAllocations 
if DM(G)' < bestDM && RT(G)' x RTthreshold then 
bestAllocations — tempAllocations 
bestRT — RT(G)' 
bestDM — DM(GY 
end if 
end for 
end for 
return bestAllocations, bestRT, bestDM 


while RT(G) remains under the threshold. The ILS-based solution 
is shown to be capable of yielding much better results at the ex- 
pense of overhead that is higher by an order of magnitude; e.g., 
in large flows it took 2-3 minutes on a modern PC to check 75 
random perturbations, each running SHC 75 times. The extremal 
optimization-inspired technique and the new greedy that we in- 
troduce in the next section manage to achieve similar quality in 
the results running much closer to the initial greedy technique, as 
discussed in the experiments. 


4 OUR PROPOSAL 


The aim is to devise fast algorithms being as effective as the ILS- 
one in [19]. 


4.1 A greedy solution that is less shortsighted 


The first algorithm we implemented is a greedy one described 
in Algorithm 1 (termed Greedy-full). The algorithm works using 
an initial solution derived by Iridium [23] (we also examine using 
a random solution in Section 5.2.3). The input of the algorithm is 
(i) the initial allocation of tasks on the DCs, (ii) the running time 
threshold RTthreshold = (1 + ¢)RTbase, where RTbase is the ini- 
tial RT(G), (iii) the initial DM(G), (iv) the initial RT(G) and (v) the 
number of iterations. The output is the new allocation of tasks opti- 
mized for lower DM(G) with the new RT(G) to be under the thresh- 
old. 

The algorithm consists of two loops. The external one is re- 
peated 20 times while the internal one iterates over all the jobs. 
The number of the external iterations is configurable but unless 
otherwise stated, we set them to 20 (see Section 5.2.2). For each 
job in topological order it finds the bottleneck DC. More specifi- 
cally, the findBottleneckDC(job) function in the algorithm returns 
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the DC that has the least, non zero, task placement ratio. For this 
task placement ratio, the algorithm further removes a proportion 
of f and distributes it to the rest of the DCs that already have tasks 
proportionally. In this work, we set f equal to 1/3. Then, it assesses 
the global impact of such a local change. It re-calculates the task 
placement of the downstream nodes and if the new RT(G) is un- 
der the threshold and the DM(G) is minimized, then the solution 
becomes the best one. 

In our previous work [19], we also implemented a greedy algo- 
rithm. The main difference with the algorithm of this work is that, 
when a job's task placement is altered, the affects are not trans- 
ferred to the downstream nodes in the internal loop; i.e., the initial 
greedy solution focuses on local changes in a shortsighted man- 
ner. However, addressing this limitation comes at the expense of 
higher optimization times to derive the final task allocation, but, 
as shown later, the trade-off is interesting. 


4.2 An EO-based solution 


We propose an EO-based solution that will be referred to as Ex- 
tremal (see Algorithm 2). Extremal uses also an initial solution, like 
Greedy-full. The input and the output remain the same for the two 
algorithms. Algorithm 2 consists of one loop. In each iteration, it 
finds the slowest job of the graph (through the findSlowestjob(G)) 
and rearranges its task placement fractions by removing a f frac- 
tion of the task ratio of randomly picked DCs. We set f equal to 
1/3 and the probability is set to 1/2. This reallocation affects the 
downstream nodes as the 5J is re-arranged to the DCs; this reallo- 
cation is computed using the Linear Programming technique (LP) 
from [23]. Then the new RT(G) and DM(G) are calculated and the 
solution becomes the best one so far only if the DM(G) improves 
and if the RT(G) is under the given threshold. The number of the 
iterations is configurable but unless otherwise stated, we set them 
to 100 (see Section 5.2.2). Compared to the Greedy-full solution, its 
main difference is that it focuses on the slowest job overall rather 
than examining all jobs in turn; then, for the slowest jobs, exam- 
ines more extensive random changes. 


4.2.1 Example. Suppose a linear G with three nodes and three 
DCs on which each node is executed in parallel. The uplink and 
downlink of the DCs are U=(10, 1, 10), D=(10, 5, 5). The SE values 
are S1-(120, 100, 50) and a -1 for both jobs. Figure 2a shows the 
result of Iridium. In the figure, each circle corresponds to a job-DC 
pair annotated by the corresponding rl value. 

We execute the loop of Algorithm 2 a single time. We set ¢ = 
0.1. Thus the threshold is set to RTthreshold = (1 + 0.1)38.19 = 
42 sec. First, the algorithm searches for the slowest job which in 
that case, is the first one. Then, it chooses a random DC that has a 
fraction of tasks larger than 0, let’s say that this DC is the third one. 
Then the algorithm removes 1/3 of its workload and transfers it to 
the 2nd DC, which is the only other DC with non-zero allocation; 
this results in ri=(0, 0.83, 0.17), which, in turn yields S?-(0, 224.1, 
45.9) and r2-(0.04, 0.96, 0). The new RT(G) is 37.18 sec. The benefit 
in the data movement is 28.3 MBs (Figure 2b). This new RT(G) is 
under the threshold so the solution is accepted. The final reduction 
over Iridium is 2.6% in terms of RT(G) and 11% in DM(G) which 
is the main metric we try to minimize. In this example, Greedy- 
full can reach the same outcome as Extremal, but the latter has 
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Algorithm 2 Extremal algorithm 


Require: allocations, RTthreshold, DM(G), RT(G), iterations 
bestAllocations — allocations 
bestRT — RT(G) 
bestDM — DM(G) 
for i <1 to iterations do 
slowest Job — findSlowestJob(G) 
for eachDC do 
With probability p, reallocate tasks regarding the slowest 
job through distributing a proportion of f of DC's fraction 
to the other DCs 
end for 


tempAllocations — apply changes to G 
Calculate RT(G)' using tempAllocations 
Calculate DM(G)’ using tempAllocations 
if DM(G)' < bestDM && RT(G)’ < RTthreshold then 
bestAllocations — tempAllocations 
bestRT — RT(G)' 
bestDM — DM(G) 
end if 
end for 
return bestAllocations, bestRT, bestDM 


12.69 sec 


(a) Iridium task allocation 


(b) Extremal algorithm task allocation 


Figure 2: Example using extremal algorithm 


performed only one reallocation (for the first job) while Greedy- 
full has checked all the jobs. 
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Figure 3: DAGs considered in the experiments (taken from 


[11]) 


5 EXPERIMENTS 
5.1 Setting 


We have already shown in Section 2 that estimated improvements 
correspond to improvements in real runs as well. To cover a broad 
range of scenarios, we resort to simulations. We use the simula- 
tion setting presented in our previous work [19], which includes 
five types of DAGs from [11] (presented in Figure 3) in three sizes 
each. The DAGs cover a very broad range of real applications, in- 
cluding DAGs produced when running TPC-H on Spark. To allow 
for a direct comparison against the results in [19], we experiment 
with 3 values of M = 5; 10; 15 and 3 values of ¢ = 0.1 and 0.2 and 
0.5. The experiments were performed for every combination of 
DAG, number of DCs and e value. Unless otherwise stated, p = 0.5, 
iterations = 20 for Greedy-full and 100 for Extremal, and f = 1/3. 
For the remainder of the variables, we resort to a setting similar 
to the one in [23]. The initial dataset F of the source nodes is ran- 
domly generated in the range [100MB, 1GB]. The Uj and Dj of each 
DC fall into the range of [100MB/sec, 2GB/sec]. The selectivities a 
of the jobs are between 0.01 and 2 with 50% of the job selectivities 
ranging from 0.01 to 0.5, 25% of them ranging from 0.5 to 1 and the 
rest 25% ranging from 1 to 2 (similar to the selectivities in Facebook 
production analytics according to [23]). For each combination of 
DAG type, M and e, we created random instances according to the 
parameters above, and we report the average values. 


5.2 Main Experiments 


5.21 Main comparison. In the first set of experiments, we com- 
pare our algorithms namely Extremal and Greedy-full to the ones 
presented in our previous work [19], Iterated Local Search and 
Greedy, regarding the reduction in RT(G) and DM(G) they achieve 
over Iridium, when we set € — 0.2. The results are presented in 
Figure 4 and Figure 5 for DM(G) and RT(G), respectively. On aver- 
age, Extremal reduces Iridium's DM(G) by 28.16%, Greedy-full by 
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Figure 4: Percentage of DM(G) reduction for M =5, 10 and 15 when ¢=0.2. 


37.83%, ILS by 50.12% and Greedy by 3.25%. In most cases, Greedy- 
full reduces the RT(G) as well by a mean reduction of 28.26%, Ex- 
tremal by 11.03%, ILS by 44.31%, while Greedy increases the RT(G) 
by 7.5%. 

As we can observe from Figure 4, Extremal is outperformed by 
ILS and Greedy-full by a large margin in the DAGs where the slow- 
est job turns out to be one close to the sink nodes, e.g., Small E, 
where a single node collects data from two previous nodes. In the 
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other cases, the behavior of Extremal and ILS is similar, whereas 
there exist several combinations of DAG types and sizes where 
Extremal is the best performing approach in average. The perfor- 
mance of Greedy-full is closer to that of ILS, but, as explained later, 
with slightly higher overhead than Extremal. 


5.22 Convergence Rate. In this section, we compare the con- 
vergence rate of Extremal and Greedy-full. In order to find the 
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convergence rate of Greedy-full we set the number of iterations 
to N*M = 6 * 10 = 60 while Extremal iterates 100 times. Fig- 
ure 6 shows the results for the Small-A DAG and 10 machines. We 
can observe that Greedy-full converges at around the 10th itera- 
tion, which is faster than Extremal which converges at around 50th 
iteration. We should also consider the running time of the algo- 
rithms. While Extremal iterates more times, it only takes 5.7 sec 
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Figure 5: Percentage of RT(G) reduction for M =5, 10 and 15 when ¢=0.2. 
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but Greedy-full takes 13.08 sec. Taken that into consideration, Ex- 
tremal converges at around 2.85 sec and Greedy-full at around 2.18 
sec (machine specifications are given when discussing time over- 
heads in more detail). This explains our choice to set the number 
of iterations of the two algorithms to 20 and 100, respectively. We 
further investigate the behavior of Extremal, ranging the number 
of iterations from 25 to 150. The results are presented in Figures 7 


225 


IDEAS’19, June 10-12, 2019, Athens, Greece 


740 0.4 
—— Extremal 


pa —— Extremal 


7205 2 Greedy-full M —2—Greedy-full 


Figure 6: DM(G) (left) and RT(G) (right) convergence rate for 
the Small-A (top) DAG when running Extremal and Greedy- 
full (M-10, ¢=10%) 
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Figure 7: Percentage of DM(G) reduction for the Small-A 
(top) and Large-E (bottom) DAGs when running Extremal 
for different M (horizontal axis), £ and number of iterations 


and 8. Setting the iterations to 100 offers a good trade-off between 
the quality of the output and the running time of the algorithm. 


5.23 Impact of initial allocation. In this set of experiments, we 
tried initializing the Greedy-full and Extremal algorithms with a 
random solution rather than the Iridium one. The results show that 
the algorithms are quite sensitive to the initial allocation as they 
cannot produce a plan that improves on the Iridium's RT(G) and 
DM(G) (no figures are shown due to space constraints). In most 
cases, the final results of the algorithms that were initialized with 
the random solution are worse than the Iridium ones. Therefore, 
the initialization phase in our solution that first optimizes for RT 
(though employing Iridium's approach) and then proceeds to DM 
minimization is crucial in the bi-objective optimization solution. 


5.24 Time overheads. The running time of each algorithm is 
presented in Table 3. The experiments were performed on a ma- 
chine with i7-4510U CPU at 2.00GHz with 8 GB of RAM. Two 
main observations can be drawn: (i) Extremal and Greedy-full in- 
cur lower overhead than ILS by an order of magnitude; and (ii) 
Greedy-full is slower than Extremal regarding non-small flows. 
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Figure 8: Percentage of RT(G) reduction for the Small-A (top) 
and Large-E (bottom) DAGs when running Extremal for dif- 
ferent M (horizontal axis), « and number of iterations 


Algorithm 3 Genetic algorithm 


Require: populationSize, recombinationProb, mutationProb, generations 


population < initializePopulation(populationSize) 
best, best RT, bestDM < getBest(population) 
for i — 1 to generations do 
parents — selectParents(population) 
children — Ø 
for each pair in parents do 
child1, child2 — recombination(pair, recombinationProb) 
children — mutate(child1, mutationProb) 
children — mutate(child2, mutationProb) 
end for 
population — combine(population, children) 
population — population(1 : populationSize) 
best, bestRT, bestDM < getBest(evaluatedPopulation) 
end for 
return best, bestRT, bestDM 


5.3 Comparison against an Evolutionary 
Solution 


In this section, we present how an evolutionary algorithm per- 
forms in our setting. Specifically, we have implemented the genetic 
algorithm described in Algorithm 3 and compared it to Extremal 
and Greedy-full. First, the Genetic algorithm initializes the popu- 
lation and finds the best solution among them. In our implemen- 
tation, we initialized the population of size 400 with the Iridium's 
solution, about 10% greedy solutions over Iridium and 90% random 
ones. Then, the population is divided into pairs from the recombi- 
nation of which new solutions (children) are produced. Then, the 
children are mutated with a small probability, inserted in the popu- 
lation and the best solution is found. This is repeated for a number 
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Table 3: Running times of the algorithms for different M values (in sec). 


Small A Medium C Large E 
Algorithm \M 5 10 15 5 10 15 5 10 15 
Iridium 0.13 0.17 0.18 0.28 0.29 0.3 0.3 0.31 0.36 
Extremal 5.44 5.7 5.71 4.37 4.73 5.95 13.91 18.17 18.77 
Greedy-full 3.66 3.99 4.12 10.8 11.39 11.81 | 20.93 22.32 24.51 
Greedy 0.16 0.19 0.21 0.8 1.12 1.31 2.45 3.29 3.6 
ILS (75 iterations) | 59.99 69.7 71.98 | 87.73 91.16 93.44 | 130.98 131.26 159.04 
150 se ——7 150 s502 ——À Figures 9 and 10 show the results, when the number of gener- 
m m ations is set to 400. As can be seen, Genetic is the least beneficial 
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Figure 9: Percentage of DM(G) reduction for the Small-A 
(top) and Large-E (bottom) DAGs when running Extremal, 
Greedy-full and Genetic for different M (horizontal axis) 
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Figure 10: Percentage of RT(G) reduction for the Small-A 
(top) and Large-E (bottom) DAGs when running Extremal, 


Greedy-full and Genetic for different M (horizontal axis) 
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of iterations called generations. The output of the algorithm is the 
best RT(G) and DM(G). 
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algorithm for our setting. That indicates that it is more effective to 
work on a single solution rather than having a collection of them, 
e.g the population in the Genetic. Moreover the random combina- 
tion of components from different solutions does not lead to a good 
outcome either. 


6 RELATED WORK 


As the amount of jobs that need to be executed in geo-distributed 
data centers is increasing, there have been several proposals for op- 
timized task placement. Many works focus on minimizing the to- 
tal traffic. For example, WANalytics [27] deals with the task place- 
ment in this regard, but does not consider the overall running time. 
[17] offers a prediction of job execution time but focuses only on 
the minimization of the data movement as well. Clarinet [26] is 
a query optimizer that chooses the best execution plan among the 
ones provided by multiple query optimizers, considering the WAN- 
consumption during scheduling and task placement. 

On the other hand, there are solutions that employ the mini- 
mization of the running time as their objective. Two earlier pro- 
posals, include Nebula [24] and Tetris [12] that overlook issues re- 
garding total data movement. Heintz et al. [13] developed a frame- 
work that optimizes the data and task placement of each phase of a 
mapreduce job focusing on minimizing the makespan of the query 
but not on the overall data movement either. Iridium [23], which 
is the work against which we compare our solution, also focused 
only on the running time. However, Iridium can modify the place- 
ment of the initial data as well. In our work, we assume that initial 
data allocation is fixed. Tetrium [15] also tries to improve upon Irid- 
ium, as we do, in two ways. Firstly, through considering the time 
spent due to computations and not only data transmission. Sec- 
ondly, through making scheduling decisions at a lower level than 
simple decision of the fraction of the tasks to run on each site to 
account for the case when the slots available are less than the allo- 
cated tasks. Both these extensions are interesting and we plan to 
investigate them in the future. Contrary to our proposal, it focuses 
mostly on response time but supports constraints on data move- 
ment (we treat the two metrics as of equal importance through 
first optimizing for response time and then for data movement); 
also, in our solutions, we manage to handle stage dependencies 
better through not running a stage-by-stage technique only once. 

There are also works that consider both metrics. For example, 
Flutter [14] is a system that performs bi-objective task placement 
online but all tasks of the same stage are allocated to a single data 


227 


IDEAS’19, June 10-12, 2019, Athens, Greece 


center. Works on multi-objective query optimization, such as [25], 
suffer from the same limitation. Finally, the work in [30] targets 
both metrics but is tailored to a single MapReduce flow with the 
reducer being executed on a single DC. In summary, none of these 
works can be applied to a generic DAG, where each DAG vertex is 
distributed across several nodes. 


7 DISCUSSION 


In this work, we proposed a fast solution that decides the task 
placement in complex analytics workflows targeting the minimiza- 
tion of both response time and data movement. The thorough ex- 
periments show that we can yield significant improvements over 
our main competitor with much less overhead than the previous 
proposal to the same end. 

In general, there are further open issues in the multi-objective 
problem we deal with. Taking into consideration the processing 
costs and capacity constraints of the participating nodes, in line 
with the work in [15], is a promising direction for future work. 
Also, investigating how the required metadata can be efficiently 
monitored online is an open issue. Finally, further research is re- 
quired for taking into account aspects such as scheduling decisions 
when multiple workflows run on the same infrastructure concur- 
rently. 
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ABSTRACT 


The increasingly massive spreading of Open Government Data 
(OGD) is hailed as a driving force for economic and social growth, 
as well as an essential factor in promoting public awareness of 
the work of institutional decision-makers. However, this high data 
availability can disorient users when deciding which sources are 
best suited to their needs. The awareness of this indecision worries 
the heads of OGD portals, who have to face the increasingly con- 
crete risk that a large part of their information assets can remain 
unused. To assess the merits of these concerns, this document aims 
to provide a snapshot on the use of OGD portals based on usage 
indicators directly or programmatically obtainable. Considering 
an adequately representative sample of OGD portals, our analysis 
highlighted two aspects. A confirmation of the fact that most of 
the published datasets are very lightly used. The perception that 
information about the use of portals is rarely made available to the 
users. 
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1 INTRODUCTION 


e-Government data covers authoritative and valuable information 
on our society. Open Government Data! (OGD) usually refers to 
public records (e.g. on transport, infrastructure, education, health, 
environment) that can be used and redistributed by anyone either 
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for free or at a marginal cost [4]. Access and free use of government 
data are seen as a goldmine of unprecedented social and economic 
potential [6][2]. However, although the exponential growth of OGD 
provides consumers with a massive amount of data, it also forces 
them to questioning on the value of these unfamiliar sources in 
meeting their information needs thus hampering their use. This 
leaves the data providers with the uneasy feeling that large part of 
their data remains untapped [14] [15]. This concern transpires in 
the reports of some US Chief Data Officers (CDOs) reported in “Are 
Open Data Efforts Working?" published on Government Technology 
Magazine in March 2018 [17], with several civic leaders reached the 
conclusion: “We counted the clicks and we saw that these portals 
just weren't being used". Although OGD are considered a driving 
force for transparency [7][11], they have limited value if they are 
not utilized [8]. 

To cope with such a situation, government agencies have to make 
sure, by monitoring users' behaviour, that people are able to directly 
or indirectly (e.g. through third-party applications) access to their 
datasets, so that they can be used to answer citizens' questions [20]. 
The adoption of metadata by OGD portals can help to facilitate user 
access through search and filtering capabilities [19][23]. In order to 
better inform potential users about the degree of adequacy of the 
data retrieved with respect to their needs, the metadata should also 
contain information on the quality and on the provenance of the 
datasets [1][13], in accordance with the W3C Web Best Practices 
Recommendation?. Moreover, metadata may report OGD usage 
indicators such the numbers of downloads and views, which drive 
users to the most popular datasets. These measures provide better 
insights on users' behaviour and may help policymakers to evaluate 
the impact of OGD resources [16]. 

This paper aims to provide an overview of the attractiveness 
of OGD portals on potential users. Initially we considered the in- 
stitutional portals of 98 countries from which we investigated the 
presence of usage metadata directly visible to the users. Subse- 
quently, our analysis focused on a set of six portals that also allow 
programmatic access via API to two usage indicators, i.e. the num- 
ber of views and downloads, for each portal dataset. 

The analysis suggests that most of the datasets directly reachable 
from the portals are little (a few dozen times at most) or rarely 
used. Because of this evaluation, we also provide some insights 
into the practices of publishing usage metrics statistics from portal 
managers, noting that even this type of metadata is rarely or only 
partially provided, making it less immediate for users to evaluate 
the reception of a dataset of their interest. This work is intended to 
be a first step towards the understanding of how and if there is any 
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relationship between the use of datasets and the inherent quality 
of the same or of the metadata associated with them. 

The paper is structured as follows. Section 2 introduces the OGD 
portals object of our analysis and the methodological approach. 
Section 3 presents the results on the availability of usage metrics 
for a set of OGD portals and some insights on their usage. Section 
4 discusses the implications of the study. Section 5 presents the 
conclusions and future works. 


2 MATERIAL AND METHODS 


The evaluation of the use of OGD is performed considering a large 
set of OGD portals all over the world of which the availability of 
usage metrics, e.g. the number of views and downloads for the 
datasets, the applications that re-use the datasets, supplied as meta- 
data visible from the platforms or recoverable in a programmatic 
way, have been verified. 


2.4 The selected OGD portals 


We have selected 98 OGD portals among those of world countries 
examined and ranked by two initiatives: the Global Open Data Index 
(GODI)? for the assessment of the publication of public data opened 
from a civic perspective, and the OECD Open Useful and Reusable 
data (OURdata) Index on Open Government Data performed by 
the Organization for Economic Co-operation and Development 
(OECD)! for the assessment of the governments’ efforts to imple- 
ment open data in the three critical areas: openness, usefulness 
and re-usability of OGD [10]. We have considered the 94 countries 
ranked by GODI in 2017 according with different data categories. In 
addition, we have also considered four countries (i.e. Korea, Spain, 
Ireland, and Estonia) not included in the GODI ranking list, but 
ranked by OECD in 2018. For each countries, we have analyzed 
its OGD portal and the visibility of usage metadata to users. Data 
collection was conducted on 28 and 29 March 2019. 

Figure 1 shows the synthesis of the results providing the rep- 
resentation of the usage metadata distribution of all the 98 OGD 
portals. It highlights a general lack of portals in providing the user 
with such metadata. Few portals provide usage information: only 
10% and 6% of the countries provide respectively Views and Down- 
loads metadata, and 2% provide other kinds of information (i.e. 
followers, reusing applications). 65% of the countries do not pro- 
vide any usage indicators and 17% portals are still at a beginning 
phase of development, they do not publish any datasets or at least 
we could not find them. 

Based on this analysis, Table 1 reports the list of portals we have 
considered in our study. They are the portals for which there is 
usage information immediately visible at metadata level. The only 
exception concerned the English portal: even if not immediately 
visible, these data has been obtained by downloading a specific CSV 
file. We have also included a single non-national portal managed by 
the United Nations Office for the Coordination of Humanitarian Af- 
fairs (UN-OCHA?), the Humanitarian Data Exchange portal (HDX) 
aimed at sharing data across crises, as it provides data of different 
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Figure 1: Usage metadata distribution of OGD portals of 98 
worldwide countries 


countries such as those included in the 17% set (e.g. Venezuela, 
Barbados, Saint Kitts) whose OGD portals has not been found. 


2.2 Usage Metrics 


The EU commission, in the yearly published “Open Maturity in 
Europe” report for 2018 [3], mentions a ‘Portal usage’ indicator, 
which takes into account portal usage metrics such as the number 
of unique visitors, the percentage of foreign visitors, typical user 
profiles, traffic generated via portals API, popular data domains and 
the most consulted datasets. While several numbers, and related 
graphs, are supplied in relation to the first five metrics, just few 
lines are dedicated to articulate on the most consulted datasets: they 
“stem from domains that are of broad public interest, such as public 
spending and procurement, mobility, social economic numbers, in 
particular housing and environment data”. 

To get insights on the data demand by users we analysed two 
indicators: the number of online views and the number of down- 
loads associated to every portal datasets [18]. We mean by Views 
the number of times the page of a dataset was loaded in users’ 
browsers and by Downloads the number of times a user has clicked 
(on URL or on a ‘Download’ button) to retrieve a resource for a 
particular dataset. These values can be found in logs? or returned 
by portal APIs and can be found, along other dataset metadata, on 
the dataset access page. 

In some way, this information accounts for the activities of direct 
users, i.e. those who access the datasets directly [15]. Perhaps, a 
more mature measure to assess the impact of datasets on end users 
could take into consideration the indirect users, those who use data 
indirectly, ie. processed by intermediaries. These values can not 
be inferred from the current portals. At the most, references to 
applications based on the datasets contained therein are reported 
in specific sections of the portal, with an indication of the datasets 
involved, but not (at least to the best of our knowledge, with the sole 
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Table 1: OGD portals and usage dimension values derived from direct portal access through Metadata (M), or downloadable 
file (D). Information updated at 26th March 2019 


Country Portal #datasets | Views | Downloads | Other 

US. data.gov 236,352 M - - 

UK data.gov.uk 47,738 D D : 

Ireland data.gov.ie 9,001 M - - 

France data.gouv.fr 35,663 - - M 

Portugal dados.gov.pt 2,060 - - M 

UN-OCHA | data.humdata.org 8,571 - M - 

Taiwan data.gov.tw 40,055 M M - 

Colombia datos.gov.co 10,231 M - - 

Latvia data.gov.lv 267 M - - 

Poland dane.gov.pl 1,077 M M - 

Slovenia podatki.gov.si 3,389 M - - 

India data.gov.in 265,929 M M - 

Russia data.gov.ru 21,878 M M ; 

Puerto Rico | data.pr.gov 179 M - - 

Korea data.go.kr 28,871 z M - 
exception of the French and Portuguese portals) with the number each dataset. However, the presence of tracking information, in the 
of applications that exploit each dataset, included in the associated GET response, is not guaranteed by default but have to be enabled 


metadata. For this reason we have limited ourselves to recovering 
direct access measures. 


2.3 Metric values retrieval 


Retrieve the values of the usage metric can not be proceed by hand, 
considering every single published dataset of each portal. This 
information can be recovered in two ways: a) by downloading, in 
very specific cases, some files containing statistics about the use of 
each portal dataset; b) programmatically, by means of specific APIs 
supplied by the software Open Data platform on which the portal 
is built. 

As regards the second option, CKAN’ is the most widely used 
open source data management system [9] that provides the tools 
for publishing, finding and using open data. It includes also a rich 
RESTful JSON API for querying and retrieving datasets information. 
It is actually used by many governments, organizations and compa- 
nies to make their huge data sources open and available. Generally, 
these organizations deploy their own instances of CKAN, personal- 
izing its default user interface and providing their own data-storage 
to store the published datasets. 

The information related to the number of views for a dataset 
can be obtained through CKAN API, extracting the content of a 
specific field called tracking_summary’, which in turn con- 
tains a pair of values total and recent (i.e. Views in the last 14 
days). When allowed by the portal it is possible to make a REST call 
that retrieves this information, returned in JSON or XML objects, 
through different http clients. To this end, we used the library htt r 
of statistical software R. In any case, we first queried the CKAN 
server to retrieve the lists of the managed datasets, and only when 
it succeeded we sent a GET call to retrieve the metadata associate to 


Thttps://ckan.org/ 
From version 2.7.3 the package show API call does not return the tracking summary, 
keys in the dataset or resources by default any more 


server side’. Below the call to retrieve the tracking information 
related to the dataset with id="xxxxx’ from data.gov. 


ds <- GET("http://catalog.data.gov/",pathz"/ 
api/3/action/package show",query-list(id 
="xxxxx"), include_tracking='T') 

cds <- content(ds) 

total <- cds$result$tracking_summary$total 


recent <- cds$result$tracking_summary$recent 


By cycling on the whole list of datasets of the portal the overall 
views situation may be recovered. Indeed, CKAN APIs only returns 
dataset Views and not Downloads information. A portal such as 
the Humanitarian Data Exchange (HDX), based on an extension to 
CKAN, also supply a specific R library (e.g. rhdx) for the recovery 
not only of views numbers but also downloads numbers. The fol- 
lowing R excerpt code retrieves usage data for the third dataset of 
the portal: 


ds <- search datasets() 
downloads <- ds[[3]1$data$total res downloads 
views <- ds[[311$data$pageviews last 14. days 


Finally, some portals, like the French and Portuguese one, use 
other OGD platforms!?, which provide different APIs than CKAN, 
both in the type of call and in the type of information returned. In 
particular the number of views, returned in the JSON response (in 
a sub-field named ‘views’), is not indicated in the metadata visible 
to users, instead two other indicators are reported namely ‘Reuse 
number’ and ‘Number of followers’. 
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3 RESULTS 


The analysis carried out on the portals listed in Table 1, focused 
on two aspects related to the use of OGD: i) the availability of 
usage information; ii) the portal usage trends related to the user 
behaviour. 


3.1 Gathering usage data 


For each portal listed in Table 1, we have looked for usage informa- 
tion obtainable via APIs. This has been only possible for the first 
six portals listed in the table. For the remaining nine, we have not 
always found the availability of this access (e.g. Poland, Slovenia, 
India), or when present the APIs could only be activated following 
a token request (e.g. Russia, Colombia). Thus, we have decided, for 
the moment, not to consider them. 

As the first six portals accept REST calls retrieving catalogue 
information (i.e. answer with a 200 code and the list of all the 
datasets in their catalog), we proceeded at retrieving the usage 
information at dataset level. However, in two cases, i.e. the Portugal 
and UK portals, the returned usage metric values are 0 for all the 
datasets (and the associated resources). For this reason, we have 
renounced to carry out further analysis for the Portuguese portal. 
Instead, for the UK we have used the values contained ina CSV file!!, 
which provided a usage information snapshot. Unfortunately, these 
data are no longer available but we have used them to understand 
the situation of the portal in the UK. 

At the end of this further skimming, the five portals from which 
we have been able to programmatically extract useful information 
are data.gov, data.gov.uk, data.gov.ie, data.gouv.fr and data.humdata. 
org. Although this may be considered a limited sample, we think 
it is however representative both in terms of the relevance of the 
national portals, and in terms of the number of datasets in their 
catalog. In fact, the statistics on the number of datasets present in 
the 98 portals, initially considered, are the following: median = 783, 
3rd quartile = 10.173, 70%, 90% and 95% percentiles, respectively, 
8.300, 40.497 and 79.149 datasets. Indeed data.gov supplies one of 
the largest OGD catalog, data.gov.uk and data.gouv.fr are two large 
size ones, and data.gov.ie and HDX fall in 30% larger. One last note: 
while the information obtained by API from the US, French, Irish 
and HDX portals provides a snapshot of the current situation, that 
derived from the CSV file of UK supplies a crystallized image of the 
datasets usage. 


3.2 Portals usage trends 


Views and Downloads values gathered from the five selected portals 
provide information about the usage frequencies for their datasets. 
Specifically, Figures 2, 3 and 4 show the Views frequency for the 
U.S., Irish and French portals, Figures 5 and 6 illustrate the Views 
and Downloads frequencies for the UK and HDX portals. 

In all cases, the curves show heavy-tailed distributions with 
few datasets with a high frequency of use, and most of them with 
very low frequency. These results are confirmed by examining 
the descriptive statistics reported in Table 2, which supplies more 
insights on the usage trend. For what concerns the relationship 
between Views and Downloads, the values available for data.gov.uk 
confirm the expected data: the number of downloads is less than 


1 https://data.gov.uk/data/site-usage/dataset - retrieved January 16, 2019 
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Figure 2: Views frequency for data.gov datasets 
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Figure 4: Views frequency for data.gouv.fr datasets 


the Views; while the inverse relationship in the case of HDX should 
not deceive: in fact the value of the Views reported concerns the 
last 14 days while that of the downloads shows the overall total. 
The analysis of percentile values is more interesting. They have 
a very similar trend: in any case we have that 25% of datasets is 
not nearly used, i.e. 0 downloads for UK and just 6 views for about 
12,000 datasets; just 1 views for about 60,000 dataset for the U.S. 
portal; 0 views for both the Irish and French portals. Things are 
just getting better if we look at the median: especially in the UK 
portal other 12,000 datasets are visualized at most by 26 users, of 
whom at most 4 have downloaded the viewed datasets. The data 
for U.S. portal bear witness to lower values with about half of the 
datasets (118,000) viewed by no more than 12 users. For the IE portal, 
half of the datasets (4,500) have been viewed no more than three 
times. While the French portal show that half its datasets have not 
be viewed. When looking at the 3rd quartile these values slightly 
improve. The English portal recorded 104 views and 19 downloads 
for at least 12,000 datasets, while 25% of data.gov datasets have 
been viewed by at least 19 users, similar to the numbers recorded 
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Table 2: Statistics of usage metrics for the US, UK, IE, F and HDX portals 


Portal Metric Min. | 1st. | Median | Mean | 3rd Max 
Qu. Qu. 
data.gov Views 0 1 12 34 | 19 | 127,643 
data.gov.uk Views 1 6 26 291 | 104 | 204,803 
Downloads 0 0 4 79 19 | 139479 
data.gov.ie Views 0 0 3 56 | 20 | 17,248 
data.gouv.fr Views 0 0 0 76 1 | 160,003 
data.human.org | Views 0 0 0 3.3 2 444 
Downloads 0 1 17 168 96 13,309 
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Figure 5: Views and downloads frequency for data.gov.uk 
datasets 


for the IE portal. As for HDX, considering that the 14-day views 
sample provides rather modest results, it seems more significant 
the number of those who downloaded the datasets with about half 
of them downloaded at least 17 times and a quarter at least one 
hundred. 

To supply an approximate idea of the frequency ranges of the 
most used ones, we reported in Table 3 the 90, 95 and 99 percentiles 
for each portal and each metric available. In the case of U.S. the 
under-usage rates seem partly confirmed also for 90 and 95 per- 
centiles, i.e. 95% datasets have been viewed no more than 59 times. 
Just 1% (about 2,300) datasets have been viewed at least by 298 
users. UK portal shows better performance: 10% datasets have been 
viewed at least 358 times and been downloaded 75 times, while 
about 400 (i.e. the 99% percentile) have about 4,000 views and 1,000 
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Figure 6: Views and downloads frequency for 
data.humdata.org datasets 


downloads. The IE portal lays in the middle, recording about 900 
datasets viewed almost 100 times and about 90 viewed by one thou- 
sand users. The French usage figures are still the lowest among the 
national portals, even in the case of the highest percentiles. If the 
values of views (but at 14 days) for HDX are relatively significant 
for the higher percentiles, those of absolute downloads are more 
than encouraging, with 10% of the datasets seen at least (approxi- 
mately) 400 times, culminating with more than 2600 downloads for 
80 ‘top’ datasets. 


4 DISCUSSION 


Our study highlights two issues related to usage of OGD portals: i) 
they are largely underused; ii) they generally do not provide users 
with usage information. 
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Table 3: Percentiles of usage metrics for the U.S., UK, IE, F and HDX portals 


Portal Metric 90% | %95 | %99 
data.gov Views 33 59 | 298 
data.gov.uk Views 358 | 807 | 3,959 

Downloads 73 | 178 | 1,071 
data.gov.ie Views 98 | 233 998 
data.gouv.fr Views 4 10 | 302 
data.human.org | Views 7 13 46 

Downloads | 376 | 655 | 2,620 


4.1 Underutilisation of OGD datasets 


The results of our analysis, albeit partial because of the small num- 
ber of considered OGD portals, highlight a situation that seems 
common to portals with different dimensions and missions: the 
majority of the published datasets is used marginally. This seems 
to confirm the ‘fears’ expressed by the CDOs survey presented in 
[17], mentioned in the Introduction. 

Obviously, the direct use observable through the adopted metrics 
does not exhaust the potential of the data offered by the portals: as 
mentioned, probably a more meaningful parameter is tied to the 
number of indirect users, namely those that use third-party appli- 
cations, in combination with the number of the same applications. 
If the number of users of an application can be difficult to collect 
and assign to a dataset, the number of applications using a data 
set could be collected as done for the aforementioned French and 
Portuguese portals. This can improve the perception of the utility 
of a dataset and provide an indicator for a quantitative assessment 
of the indirect use of the portal. To facilitate this recognition, third- 
party applications that use a data set should always be encouraged 
to list it among their sources[21]. This would help users not only to 
know the provenance of the original data, but even more to make 
the products of these applications reproducible and therefore more 
reliable[1]. 

Although the choice of metrics we adopted can influence the 
extent of the assessments on the use of the dataset, they provide 
significant indicators to the CDOs to understand if the datasets 
published on the portals they manage attract the interest of the 
users. According to one of them: “We look at the total number of 
datasets that are out there, what we are offering up. We count visit 
clicks, and lastly, we look at how many downloads are actually 
being done off the open data portal” [17]. 

One wonders then why some some datasets get more attention 
than others do, and in some cases thousands of datasets are com- 
pletely ‘invisible’ to users. A plausible cause is attributable to the 
degree of popularity of the thematic domain of each dataset. As 
emerged from the cited European Community report [3], some 
thematic domains (e.g. Government and public sector, Population 
and social conditions) are more popular than other (e.g. Health, 
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Justice and Public safety). We focused on UK portal and we verified 
the impact of the thematic domains on the data Views: we analyzed 
the thematic domains of the most viewed datasets (i.e., belonging to 
95 percentile) and the less viewed (i.e. belonging to 25th percentile). 
The aims is to understand if some thematic domain turn out to be 
the prerogative of the most viewed datasets. Examining the two 
graphs in figures 7 and 8 this hypothesis seems to apply in particular 
to the datasets cataloged with respect to the thematic domains 'soci- 
ety’ and ‘health’ which in a significant percentage belong to the 95 
percentile, while to a lesser extent to the 25 percentile. Conversely, 
the datasets of the 'environment' thematic domain, although they 
are among the most present in the 95 percentile, with about 16% of 
the most viewed, are also those with more presences (40%) in the 
lower quartile. According to Figure 9 the 'environment' datasets 
are also the most present (i.e. 26%) in the UK portal. 

Since membership of a certain thematic domains does not seem 
to be entirely relevant to the fact that some datasets are more 
used than others are, other complementary causes must be sought. 
Based on the literature we believe it is worth investigating whether 
there is any correlation between the popularity of a dataset and 
the quality of its data [22] and metadata [9]. While, the quality of 
the dataset can only be analyzed when it has been downloaded in 
whole or in part, and therefore the fact that it is not re-used is only 
an ex-post consequence of its usage, the quality of the metadata can 
effectively preclude visibility (to search engines, even inside OGD 
portals) and therefore immediate use [8][12]. In addition, users may 
be disoriented by the non adoption of metadata standards or by 
their heterogeneity, amongst different portals. To face these issues 
initiatives (such as w3c!2, OGC!3, INSPIRE!4, FAIR?) recommend 
providing metadata according to existing standards, to "facilitate 
interoperability between data catalogues published on Web". In 


2 https://www.w3.org/ 

3 https://www.opengeospatial.org/standards 
https:;//inspire.ec.europa.eu/ 
Dhttps://www.go-fair.org/fair- principles/ 


234 


Open Government Data usage: a brief overview 


15- 
10- 
DL M 


society 
environment 
health 
towns-and-cities 
transport 
education 
government 
mapping 
crime-and-justice 
defence 


business-and-economy 
government-spending 


Figure 7: Distribution of the Views for the top (95 percentile) 
datasets on data.gov.uk according to thematic domains. 
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Figure 8: Distribution of the Views for the bottom (25 per- 
centile) datasets on data.gov.uk according to thematic do- 
mains. 
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Figure 9: Distribution of popularity (according to Views) of 
datasets of the UK portal respect to thematic domains. 


particular, DCAT!Ó has recently designed to improve the data cata- 
logues interoperability and to allow applications to easily consume 
metadata from multiple catalogues?’. 


4.2 Scarcity of usage data 


A second critical aspect, which emerged from our analysis, concerns 
the rarity or non-availability of usage information. As stated in 
Sections 2.1 and 3.1, in deciding which OGD portals to include in our 
analysis we realized the difficulty in finding any usage information, 
both direct and indirect, in most of the main OGD portals managed 
at national or international level. In addition, in those few portals 
where this information is made available, the usage indicators are 
not always all present (see Table 1). Or again, as in the portals in 
which we found usage metadata available directly to the users, the 
programmatic access enables to find only partial or empty metadata 
as highlighted for the French, Portuguese and UK portals. Hence 
the difficulty of obtaining a homogeneous synthesis able to provide 
a systematic image on the use of the different portals. This lack 
would seem to imply that CDOs underestimate the importance of 
informing users about the popularity of their datasets. As observed 
by Sasse [16] however, use indicators such as Views and Downloads 
would have the potential effect of diverting users' attention to the 
datasets published on their portal, instead of those available on 
competing portals. In short, this ‘popularity’ information could 
work similarly to that used to attract users / customers to a social 
media or web economy platform and be collected by government 


1https://www.w3.org/TR/vocabdcat/ 
http://devinit.org/wp-content/uploads/2018/01/Metadata-for-open-data-portals. 
pdf 
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portals thus to improve customer service [5]. The fact that this does 
not happen, in most of the OGD portals, makes us suspect that in 
such cases the effect “no one visits it”, as was the case for many of 
the datasets examined, would constitute a “boomerang effect” that 
the managers of the portals prefer to avoid!?. 


5 CONCLUSIONS AND FUTURE WORKS 


The paper provides a preliminary overview about the use of OGD 
portals. It contributes i) to outline a common under-use of most 
(of the datasets) of OGD portals; ii) to highlight the lack of usage 
(meta)data of the datasets by the portals themselves. 

Regarding the first point, from a set of 98 national OGD portals 
around the world, we have selected a subset of five which provide 
direct (metadata) or indirect (via access via programmatic API) 
access to the usage metrics (i.e. Views and Download) in each portal. 
The results show that the frequencies of use follow a long-tailed 
distribution for all the portals analyzed. From a first investigation, 
it seems that the reason for a preference, so relevant to users for 
few datasets, is not immediately attributable to their belonging to 
a particular thematic domain. We advance instead the hypothesis 
that this is in some way related to the quality of their metadata. 

As for the second point, we have given some insights on prac- 
tices of publication of datasets usage metadata by portal operators, 
noting that usage metrics values such as the numbers of views and 
downloads are not easily accessible to users or missing from most 
of OGD portals. As observed this makes less immediate for users 
to evaluate the reception of a dataset of interest to them. 

In our future work, we will analyze the potential correlation be- 
tween the popularity of the data set and the quality of its data/meta- 
data, considering that, as recommended by the literature, it is ad- 
visable to publish good quality data and metadata to improve user 
understanding and, therefore, increase the use of the associated 
datasets. 
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ABSTRACT 


The data posting framework introduced in [8] adapts the well-known 
Data Exchange techniques to the new Big Data management and 
analysis challenges that can be found in real world scenarios. Al- 
though it is expressive enough, it requires the ability of using count 
constraints and may be difficult for a non expert user. Moreover, 
the data posting problem is NP-complete under the data complex- 
ity in the general case, then the use of the non-deterministic vari- 
ables is performed. Indeed, identifying the conditions that guaran- 
tee polynomial-time execution in the presence of non-deterministic 
choices is very important for practical purposes. In this paper, we 
present a simplified version of data posting framework, based on the 
use of the smart mapping rules, that integrate the simple mapping 
description with some parameters, avoiding the complex specifica- 
tions with count constraints. We show that the data posting problem 
in the new setting is NP- complete and identify the conditions under 
which this problem becomes polynomial even in the presence of 
non-deterministic choices. 
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1 INTRODUCTION 


Big Data paradigm[1, 32, 33] recently come on scene in a quite 
pervasive manner, however this apparently sudden change of per- 
spective had a long history before the term Big Data was coined. 
Indeed, both industry and research people have been entrenched in 
(big) data that have been stored in massive amounts, with an increas- 
ing speed and exhibiting a huge variety for over a decade before 
the Big Data paradigm was officially born. The major challenge has 
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always been to unveil valuable insights for the industry to which 
these particular data belonged. 

As a matter of fact, sifting through all of these data, parsing them, 
transferring them from a source to a target database, and analyzing 
all of them for purposes of improving business decision-making 
processes turn to be too complex for traditional approaches. In the 
presence of incomplete databases, certain answers are a principled 
semantics of query answering [10, 15]. Since the computation of 
certain query answers is a coNP-hard problem, recent research has fo- 
cused on developing polynomial time algorithms computing a sound 
(but possibly incomplete) set of certain answers [12, 16, 17, 22, 23]. 
Approximation algorithms offers a possible solution, when detailed 
information is required. However, in the Big Data scenario we are 
often interested in succinct information and in the discovery of new 
knowledge. To address this issues, some proposals have been made, 
like the Data Posting framework [8]. One of the most important 
features of Data Posting is the enriching data while exchanging them 
between the sources and the target database. Intuitively, the Data 
Posting setting consists of a source and a domain database schemes, 
a target flat fact table, a set of source-to-target mapping rules and a 
set of target constraints. The data posting problem associated with 
this setting is: given finite source and domain database instances, 
find a finite instance for the target fact table that satisfies the internal 
integrity constraints and the mapping requirements. 

The problem of finiteness of the Target database is well known in 
the context of Data Exchange. The presence of existential quantifiers 
in the mapping rules and their replacement with null values can cre- 
ate situations in which the finiteness property of the Target database 
could be not satisfied. Data Posting approach use non-deterministic 
variables instead of the existentially quantified ones in the mapping 
rules (the so called, Source to Target Generating Dependencies). 
The values for the non-deterministic variables can be chosen non 
deterministically from the finite domain relation following a strategy 
that leverages count constraints [31]. Obviously, the solution to the 
data posting problem could not be universal as it represents a specific 
choice. However, as mentioned before, in the context of Big Data 
we are often interested in the discovery of new knowledge and the 
overall analysis of the data, moreover some attributes of the target 
tables may be created for storing the discovered values. Thus, the 
choice of actual values can be seen as a first phase of data analysis 
that solves uncertainties by enriching the information contents of the 
whole system. Consider the following application scenario that we 
will use as running example. 


EXAMPLE 1. The databases Sı and Sz describe the user's pro- 
files represented by relations P4(I,N, V) and P2(I,N, V) respectively, 
with attributes I (profile's identifier), N (attribute's name) and V 
(attribute's value). The problem is to enrich the user profiles from P1 
with some “relevant” attributes from P2. 
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The scenario described in the example above requires the solution 
of two main problems: 


(1) extract the information about profiles compatibility in the 
relations P; and P2 

(2) identify "relevant" attributes and their values for each profile 
in P4. 


The first task can be considered as a kind of soft-clustering that aims 
at grouping similar users in order to classify them for further analysis, 
e.g., mail classification [24], trajectory grouping[25], biological data 
analysis[26]. The second one requires the definition of the choice 
strategy based on the user/designer experience. In particular, the 
choice of the name-value combination has to take into account two 
different needs: 1) deciding if the attribute has to be added to the 
profile, 2) selecting the value of this attribute like for classical data 
warehousing [11]. One possible criterion and its expression with the 
standard data posting constructs is described below. 


Example 1 (continued). Suppose that we extract the information 
about profiles compatibility and store them into a relation C(I4, I2,L). 
In particular, the first two attributes contain profile's identifiers from 
tables Pı and P2, respectively, whereas L represents the level of com- 
patibility of these profiles. In order to enrich P4 with some "relevant" 
attributes from P2 we can set the following strategy: An attribute 
combination name-value (n2, v2) taken from P» is “relevant” to the 
profile with identifier I, described in the relation P4 if the following 
conditions hold: 1) it is sufficiently supported, i.e. it is supported 
by at least 10 profiles from P2 with a percentage of compatibility to- 
wards i at least 50%; 2) if different values corresponding to the same 
attribute are sufficiently supported, only the one with the greatest 
support value is "relevant" and will be added to I. 

The description of this scenario with the standard data posting 
constructs can be done as follows. We define a domain relation 
containing only the values 0, 1, and —1. The target relations are: 


e A(I1, I2, N2, V2) stores the information of the profile's from 
P2, whose compatibility level with some profile in P1 is at 
least 50%. 

e Add(Ii,No, Vo,Flag) stores the combinations name-value 
(N2, V2) taken from P2 and the decision Flag to add this 
couple to the profile I1. The values of the attribute Flag have 
the following meaning: 

-1 the combination (N», V2) is not added, because it is not 
sufficiently supported; 

0 the combination (N32, V2) is not added, although it is suf- 
ficiently supported, the value V2 is not selected following 
the preference specification; 

1 the combination (N2, V2) is added, as it is sufficiently sup- 
ported and the value V» is selected following the preference 
specification. 


The source to target dependencies are 
P2(i2,n2, v2) ^ C(i1, i2, 1) A 1 = 0,50 > A(i1, i2, n2, v2) 


P2(io,n2, v2) AC(i1,i2,1) A12 06,50 ^ D(flag) > 
Add(ii, n2, v2, f lag) 


where all variables are universally quantified. The fact that D is 
domain relation and its presence in the body of the second constraint 
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states that only one value between —1, 0 and 1 can be chosen as the 
flag value for each triple (i, n», v2) in the relation Add. 
The following count constrains set the selection criteria: 


Add(ii, na, v2, 1) — #({ I2 : A(i4, I2, no, v2)]) = 10 
Add(ii, n2, v2, 0) — #({ I2 : A(i1, I2, no, v2)]) = 10 
Add(ii,no, vo, 1) — #({ I2 : A(i1, In, n2, v2)]) < 10 
Add(ii,no, , ) — &((V:Add(ij,nog, V, 1))) = 1 
Add(ii, n2, v2, 1), Add(i4, n2, v, 0) > 

#({ I2 : A(i1, I2, n2, v2)]) > #({ I2 : A(i1, I2, n2, v)}) 


The data posting setting is expressive enough, however, as shown 
in the example above, it requires the ability of using count constraints 
and may be difficult for a non expert user. Moreover, as shown in 
[30] the complexity of the data posting problem is NP-complete 
under the data complexity in the general framework, where the use 
of the non-deterministic variables is performed. In the absence of 
non-deterministic variables, the problems becomes polynomial, but 
this condition is too restrictive in practice. Thus, identifying the 
conditions that guarantee polynomial-time execution in the presence 
of non-deterministic choices is very important for practical purposes. 

Recently, in [27], the use of smart mapping rules to support user 
suggestion in a big data environment has been proposed. In this paper 
we further investigate this idea and present a simplified version of 
data posting framework, called Smart Data Posting. We show that 
the data posting problem in the new setting is NP-complete and 
identify the conditions when this problem becomes polynomial even 
in the presence of non-deterministic choices. 

The model of our running scenario in the new framework is 
described below. 


Examplel (continued). Our running scenario can be modelled as 
follows. 


e P5 and C are source relations; 

e Add(I1,No, V2) is target relation, which specifies the tuples to 
be added in the other target relation P1; 

e the smart mapping rule is reported below: 


P2(io,n2, V2) A C(31,i2, 1) ^ 
15,10, (v2, uniqueAmax) 


1> 0,50 Add(ii, n», v2) 


Intuitively, the body of the rule allows to restrict the attention to the 
profiles with at least 50% compatibility, the selection criterion has 
been synthesized on the arrow, indicating 1) the support variable 
(i2), 2) the minimum quantity (10) of support instances to be able to 
map, and 3) the variable (v2) whose value should be chosen from 
the set of candidate values with their preference for this choice 
(unique ^ max). 


Note that the use of the smart mapping rule makes the model 
of the described scenario more simple and intuitive. Indeed, this 
representation evidences, that the analyst can built the local selection 
criterion focusing only on a small set of parameters. The selection 
criteria, that can be represented by smart mapping rules, is a particu- 
lar parametric combination of aggregation, counting and selection 
operations. The standardization of its representation allows us to 
optimize the implementation of the data posting process. 
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Plan of the paper. In Section 2 we describe the background 
of our approach. In Section 3 we present the Smart Data Posting 
framework. In Section 4 we perform a complexity analysis of the 
framework. Finally, in Section 5 we draw our conclusion. 


2 BACKGROUND 


Data Exchange. A schema is a finite collection R = {Rj,..., Rx} 
of relation symbols. Each relation symbol has an arity, which is a 
positive integer. A relation symbol of arity n is called n-ary, and has 
n distinct attributes, which intuitively correspond to column names. 
An instance I over the schema R is a function that associates to each 
n-ary relation symbol R; an n-ary relation I(Rj). With a little abuse 
of notation we will use R; to denote both the relation symbol and 
the relation that interprets it. Given a tuple t occurring in a relation 
R, we denote by R(t) the association between t and R and call it a 
fact. An instance can be conveniently represented by its set of facts. 
R(v), where v is a vector of variables or constants with the arity of 
R, is called atom. If R is a schema, then a dependency over R is a 
sentence in some logical formalism over R. 

A Tuple Generating Dependency (TGD) is formula of the form: 


Vx $(x) > 3y y(x, y) 


where $(x) and y(x, y) are conjunctions of atoms, and x, y are lists 
of variables. 
Full TGDs are TGDs without existentially quantified variables. 


An Equality Generating Dependency (EGD) is formula of the form: 
Vxój(x) > x1 = x2 


where ¢(x) is conjunction of atoms, while x; and x» are variables 
in x. In our formulae it is common to omit the universal quantifiers, 
when their presence is clear from the context. The left hand side 
(w.r.t. the implication symbol) of a data dependency is called body, 
whereas the right hand side is called head. 

Let S = $;,.., 5; and T = Ti, ..., Tm be two disjoint schemas. 
We refer to S as the source schema and to the S;'s as the source 
relation symbols. We refer to T as the target schema and to the T;’s 
as the target relation symbols. Similarly, instances over S will be 
called source instances, while instances over T will be called target 
instances. If I is a source instance and J is a target instance, then 
we write (I, J) for the instance K over the schema S U T such that 
K(S;) = I(Si) and K(Tj) = J(Tj), for i < n and j < m. 

The data exchange setting [2, 10] is a tuple (S, T, XT, =r), where 
S is the source relational database schema, T is the target schema, 
XT are dependencies over the target schema T and XsT are source- 
to-target TGDs. The dependencies in gr map data from the source 
to the target schema and are TGDs of the form 


Vx( s(x) > 3y Vr(x. y)) 


where $s5(x) and Wr(x, y) are conjunctions of atomic formulas on S 
and T, respectively. Dependencies in Xsr are also called mapping 
dependencies. Dependencies in Xr specify constraints on the target 
schema and can be either TGDs or EGDs. 

The data exchange problem associated with this setting is the 
following: given a finite source instance I, find a finite target instance 
J such that (I, J) satisfies XsT and J satisfies XT. Such a J is called 
a solution for I. 
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The computation of an universal solution (the compact repre- 
sentation of all possible solutions) can be done by means of the 
fixpoint chase algorithm, when it terminates [9]. The execution of 
the chase involves inserting tuples possibly with null values to sat- 
isfy TGDs, and replacing null values with constants or other null 
values to satisfy EGDs. Specifically, the chase consists of applying 
a sequence of steps, where each step enforces a dependency that is 
not satisfied by the current instance. It might well be the case that 
multiple dependencies can be enforced and, in this case, the chase 
picks one nondeterministically. Different choices lead to different 
sequences, some of which might be terminating, while others might 
not. Unfortunately, checking whether the chase terminates is an un- 
decidable problem [9]. To cope with this issue, several "termination 
criteria" have been proposed, that is, (decidable) sufficient condi- 
tions ensuring chase termination. Some recent works can be found 
in [3, 4, 19, 20, 28, 29], a tool for checking chase termination has 
been described in [13]. 


Data posting. Differently from the standard Data Exchange ap- 
proach, the Data Posting [30] search for more expressive constraints 
to enrich the contents of the exchanged data. We start from the 
definition of the involved database schemata. 

Let S = (S1,...,5n), D = (D1,..., Dm), and T = (Th,...,Tq) 
be be two disjoint schemas. We refer to S (resp. D, T) as the source 
(resp. domain, target) schema and to the S;'s (resp. Dj, Tg) as the 
source (resp. domain, target) relation symbols. We assume that all 
instances over S and D are finite. As it will be shown in this section, 
any target instance over T is finite as well, given the structure of our 
mappping constraints defined below. 

A non-deterministic source-to-target TGD is a dependency over 
(S, D, T) of the form 


Vx[és(xU y) > ér(2] 


where x and z are lists of universally quantified variables; y is a 
(possibly empty) list of variables, called non deterministic, these 
variables can occur in $s only in relations from D; x n y = 0 and 
z € xU y; the formula ġs and yr are conjunctions of atoms with 
predicate symbols in S U D and in T, respectively. 

The non-deterministic TGDs can be seen as the standard TGDs 
there existentially quantified variables are replaced with non-deter- 
ministic variables, whose values can be chosen from the finite do- 
mains defined by domain relations. The mapping process is per- 
formed as usual but presumes that for every assignment of x a subset 
of all admissible values for y can be chosen in an arbitrary way. 
Every possible choice is called non-deterministic domain mapping. 

Let I = (Is, Ij) be given, where Is and Ip are finite source 
instances for S and for D, respectively. The active domain AD; 
is the set of all values occurring in Is and Ip. Let an admissible 
instance Ir for T be also given, that is an instance whose values 
all occur in AD;. The semantic of t states whether t is satisfied or 
not by (Is, Ip, Ir). The notion of satisfiability is introduced after 
preliminary fixing one of the possible non-deterministic domain 
mappings, say f;. 

We say that (Is, Ip, I) satisfies t w.r.t. f; if for each X € (AD)? 
and for each Y € f((X): either $s(xU y)[x/X, y/X] is made false by 
(s. Ip) or ór(z)[z/(X U Y);] is made true by Ir, where the substi- 
tution [x/X, y/Y] assigns the values X and Y to the corresponding 
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variables in x and y, respectively, in the formula $s and it induces 
a substitution, denoted by [z/(X U Y);] for the variables of z in the 
formula $T as well, since z € x U y by definition. 

Given a set È of non-deterministic source-to-target TGD con- 
straints and finite source instances Is for S, Im for D and Ir for 
T. (s,Ip,Ir) satisfies X if for each t € X, there exists a non- 
deterministic domain mapping f; such that (Is, Ip, Ir) satisfies t 
w.r.t. fr. 

As an example consider the source relation objects and the 
domain relation D reporting all possible characterizations of ob- 
jects, whose instances are Ig = {r}, where r denotes a restaurant, 
and D = {(r, fish), (r, meet), (r, expensive), (r, cheap)}. The fol- 
lowing non-deterministic TGD can be used to assign characteri- 
zation to the objects choosing them from the domain relation D 
non-deterministically. 


objects(n) A D(n, v) — descriptionr(n, v) 


This constraint is satisfied by (Is, Ip, Ir,), where Ir, = {(r, fish), 
(r, expensive)) and is not satisfied by (Is,Ip,I7,), where Ir, = 
{(r, green). 

In the case of empty y, the non-deterministic TGD corresponds 
to a full TGD and its semantics corresponds to the standard one. For 
instance, 

objects(n) — objectr(n) 
simply creates a copy of the source relation objects. 


A count constraint is a dependency over T of the form 


Vx[ór(x) > #({y : dza(x, y,z)}) <op> f(x) ] 


where $T is a conjunction of atoms with predicate symbol in T, 
«op» is any of the comparison operators (=, >, >, < and X), 
H = (y : 3za(x, y.z)) is a set term, # is an interpreted function 
symbol that computes the cardinality of the (possibly empty) set 
corresponding to H, #(H) is count term, and f(x) is an integer or 
a variable in x or another count term with universally quantified 
variables in x. The two lists y and z consist of distinct variables 
that are also different from the universally quantified variables in x, 
a(x, y, Z) is a conjunction of atoms T;(x, y, z) with T; € T. 

To define the semantic of a count constraint, we assume that an 
instance Ir for T is given. Then, we consider the active domain 
AD as the set of all values occurring in Ir. Given a substitution 
x/X assigning values in AD; to universally quantified variables, 
Kx = (y : 3za(X, y. z)) defines the set of values in AD; assigned 
to the free variables in y for which 4z a(X, y, z) is satisfied by Ir 
and #(Kx) is the cardinality of this set. We say that Ir satisfies 


r) > *((y : 3za(x.y.2)) «op» B(x) 


if each substitution x/X that makes true ¢7(x), makes also true the 
head expression #({y : 3za(x. y,z)}) «op» B(x)). 
As an example, the count constraint 


objectr(n) > #({V: description(n, V)}) = 2 


states that every object must have exactly 2 characterizations. 
Observe that target count constraints extend both TGDs and EGDs 
of the classical data exchange setting. 


We are now ready to formulate the data posting problem: 
The data posting setting (S, D, T, sr, XT) consists of a source 
database schema S, a domain database scheme D, a target flat fact 
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table T, a set sr of source-to-target TGDs and a set Èr of target 
count constraints. 

The data posting problem associated with this setting is: given 
finite source instances Is for S and Ip for D, find a finite instance Ir 
for T such that (Is, Ip, Ir) satisfies both X sr and Xr. This problem 
is NP-complete under the data complexity. Obviously, in the case 
than Xsr is composed by only full TGDs, the problem becomes 
polynomial. 


3 SMART DATA POSTING 


The Smart Data Posting setting is based on the idea that the standard 
source to target dependencies can be enriched with the selection 
criterion regarding the local exchange process. The obtained depen- 
dencies, that we call smart mapping rules, are expressive enough 
for different practical applications and can be profitably used for 
simplifying and optimizing the standard Data Posting setting. 


DEFINITION 1. A smart mapping rule can be defined as follows: 


y-k, (v. f) 


Vz[ $(z) —— 5 r(x,v)] 


where x, y, z, v are vectors of variables, such that x U y U v € z and 
x, y and v do not share variables; $s is the conjunction of literals 
and expressions involving comparison operators (>, <, >, <, =, +) 
and variables in z or constants; r is a target relation; y is called a 
support vector; k is a natural number (greater than 1) which indicates 
the support value; y and k may be omitted together, the pair (v, f} 
indicates how the choice for the values of v should be performed: 
f can be max, unique, or the conjunction unique ^ max, the pair 
(v, f) may be omitted. 


Semantics. The smart mapping rule specifies that the tuple (X, V) is 
added to r only if it is supported by at least k (different) initializations 
(Ya, ... Yy) of y , i.e. for each j € [1..k] there exists an initialization 
Zj of z, that maps x, y e v in X, Yj and V respectively, and that makes 
true ¢(Zj). If both y and k are omitted, all initialization satisfying 
the body satisfy this first check. 

In the case than no further indications of choice are specified (the 
third arrow label is omitted) all the tuples satisfying the first check 
are added to r. Otherwise, the set of tuples to be added is further 
reduced using f for the selection of values in v. 

The statement (v, unique) specifies that the tuples transported 
into the r relation must obey the functional dependency x — v, i.e. 
for each assignment of values in x the assignment of values in v 
must be unique. In the case than several tuples are supported by at 
least k (different) initializations of y and they have the same values 
in x, the choice can be made arbitrarily. 

The statement (v, max) specifies that, for each X only tuples 
supported by a maximum number of initializations of y must be 
selected. It is easy to see that this constraint does not guarantee the 
uniqueness of the choice. For example, it is possible that two tuples 
(X, V1) and (X, V2) have the same degree of support corresponding 
to the maximum value. 

The statement (v, unique ^ max) specifies that, fixed X, only one 
tuple (X, V) can be chosen among those supported by a maximum 
number of (different) initializations of y. 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 240 


Simplified Data Posting in Practice 


EXAMPLE 2. Consider again our running Example 1. Below 
we report some selection strategies and the corresponding smart 
mapping rules. 

(1) A description of the attribute’s name-value (n2, v2) stored in 

S2 is added to the profile I4 only if it is “supported” by at least 
10 profiles of the source S5 with a compatibility percentage 
towards I; of at least 50%. In the case of different combina- 
tions characterized by the same name but having the different 
value, the “most supported” ones are chosen (they can be 
two or more combinations supported by the same number of 
profiles in S2). 


Po(i2,n2, v2) ^ C(i1, i2, 1) A 1 > 0,50 
12,10, (v2, max 
E Add(i1, n2, v2) 


(2) All name-value combinations (n2, v2) “supported” by at least 
100 profiles belonging to source S5 with a percentage of com- 
patibility towards profile I; of at least 70% must be added to 
profile I4. 

P2(io,n2, v2) A C(i1, i2, 1) A1 > 0,70 
in, 100 . 
——3À Add(ii n2, v2) 

(3) The attributes to be added to the profile J; are those present in 
the profiles of the source Sz with a percentage of compatibility 
towards I; of at least 40%. If there are different combinations 
characterized by the same name having different values, only 
one combination is chosen arbitrarily. 


Po(i2,n2,v2) ^ C(i1, i2, 1) A 1 > 0, 40 


roma? » Add(1o, no, va) 


We will call a smart mapping rule non-trivial if the arrow has 
at least one label, and trivial otherwise. Obviously, trivial mapping 
rules correspond to full TGDs. 


DEFINITION 2. The Smart Data Posting setting (S, T, =, X.) con- 
sists of a source and a target database schemes S and T, a set , a set 
X of smart mapping rules, and a set Xp of target constraints. Smart 


y-k, (v f) 


mapping rules in Ł are of the form Vz[ $s(z) ———— r(x,v) ], 
where $s denotes the conjunction of source relations. Each target 
relation defined by the non-trivial mapping rule is defined by only 
this rule. Xy is composed by count constraints involving only target 
relations. 

The data posting problem associated with this setting is: given a 
finite source instance Is for S, find a finite instance Ir of T such that 
(Is, Ir} satisfies X: U XT. 


Semantics. The semantic of the Smart Data Posting setting can be 
done in terms of the standard Data Posting setting. In particular, the 
standard Data Posting setting corresponding to a given Smart Data 
Posting setting can be constructed as follows. 

Initially, the source and the target schemes as well as the target 
count constraint can be taken from the Smart Data Posting setting. 
The domain schema contains the unary domain relation schemes D1 
and D2. The domain relation Dı contains values —1 and 1, while D2 
contains values —1, 0 and 1. The set of source to target constraints is 
empty. 
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Next, we traduce every non-trivial mapping rule p of the form 


Val gs) 57 ra v)] 
into the constructs of the standard setting. 

In the following, if (v, f) is omitted, Dp denotes Dj, otherwise 
Dp denotes D2. We introduce in the target schema the target rela- 
tions A5 (X, Y, V) and Add, (X, V, Flag), where X, Y and V represent 
vectors of attributes corresponding to the vectors of variables x, y 
and v, respectively, while the decision weather to select the pair 
(x, v) in the target relation r is stored by the attribute Flag: —1 or 0 
(not added) and 1 (added). 

The set of source to target dependencies is enriched with the 
following rules: 


$s(z) > Ap(x y, v) 
$s(z) ^ Dp(flag) > Addp(x, v, flag) 


where all the variables are universally quantified. The use of the 
domain relation Dp ensures that only a value among —1, 0 and 1 can 
be chosen for each (x, v) in the relation Addp. 

Finally, the set of target count constraints is modified as follows. 
First, we replace every occurrence of r(z) with Add (z, 1). Next, we 
add the set of constraints that allow us to establish the flag value: 


(1) We start by adding support constraints, that ensure that the 
value —1 is assigned to the variable flag iff the degree of 
support of the combination (x, v) does not reach k. First, we 
add constraint for values 1 and —1. 


Addo(x, v, 1)  #({ Y : Ap(x, Y, v)}) = k 
Add p(x, v, -1) > #({ Y : Ap(x, Y, v)}) < k 


If (v, f) is not omitted, we also add constraint for value 0: 
Add p(x, v, 0)  #({ Y : Ap(x, Y, v)}) 2 k 
(2) When f = unique the uniqueness choice constraint is added: 
Add p(x, _, flag) A flag 2 @ — #({ V : Addp(x, V, 1)}) = 1 


This constraint ensures the exactly one initialization of v for 
each X. 
(3) When f = max we add (i) the optimization constraint: 


Add p(x, v2, 1), Add, (x, v, 0) 
— #({Y : Ap(x, Y, v2)}) = &((Y : ApG Y, v)}) 


(ii) the choice constraint, ensuring that at least one initializa- 
tion of v for each X is selected: 


Add p(x,_, flag) A flag > 0 — #({ V : Addo (x, V, 1)}) > 1 


(4) When f - unique ^ max we add the uniqueness choice 
constraint and the optimization constraint. 


4 COMPLEXITY RESULTS 


In this section we perform a complexity analysis of the framework. 


THEOREM 1. Given a Smart Data Posting setting (S, T, X, E) 
and a finite source instance Is for S, the problem of deciding whether 
there exists a finite instance Ir of T such that (Ig, I) satisfies XU XT 
is NP-complete under the data complexity. 
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Proof. Membership to NP is obvious: it is sufficient to guess an 
instance Ir of T and to check whether or not (Is, Ir) satisfies XU Xr. 
Observe that the size of Ir is polynomially bounded by the input 
size as no duplicated tuples are allowed in a relation. Furthermore, it 
is easy to see that checking all constraints on Ir can be easily done 
in deterministic polynomial time. 

To prove NP-hardness we next produce a reduction from the 
graph 3-coloring, which is well known to be NP-complete. Take 
any (undirected) graph G = (N, A), where N is the set of nodes and 
ACNXN is the set of arcs. We are also given three colors, say g, r 
and b. 

We define a source scheme S consisting of the relations nodes(N), 
arcs(Nj, N2), and colors(C), storing the nodes and arcs of the graph 
and the admissible (three) colors, respectively. The target data- 
base scheme T contains the relations arcr(N1, N2) and cnpz(N, C) 
describing the arcs of the graph and the node's colors, respectively. 

The set of X of smart mapping rules is composed by the rules: 


(1) : arcs(ni, n2) — arcr(ni, n2) 


(2) : nodes(n) ^ colors(c) see) enr(n, c) 


that simply copy the content arcs relations (rule 1) and assign a 
unique color to every node in a non deterministic way (rule 2). 
The set Xr is composed by the following count constraint: 


(3) : arcr(ni1, n2), enr(ni, c1), enr(n2, c2) > 
s((Y:Y2c^Y-2c9))20 


that ensures that the nodes of an arc have different colors. 
It turns out that the data posting problem admits a solution if and 
only if the graph is 3-colorable. 


The Smart Data Posting setting is called Semi-deterministic if the 
non determinism in the data posting process is locally-resolvable, i.e. 
if each target relation defined by the mapping rule with uniqueness 
requirement is not involved in XT. 


THEOREM 2. Given a Semi-deterministic Smart Data Posting 
setting (S, T, X, =) and a finite source instance Is for S, the problem 
of deciding whether there exists a finite instance Ir of T such that 
(Is, Ir) satisfies X: U XT is polynomial under the data complexity. 


Proof. The application of smart mapping rules requires selection, 
projection, join and aggregation operations. In the case of the unique- 
ness requirement, one value can be selected arbitrarily. Thus, the 
number of tuples that can be added to the target instance is polyno- 
mial in the size of the relations and the domains that occur in their 
bodies. 

Since each target relation defined by the non-trivial mapping 
rule is defined by only this rule, its instance is generated following 
the indication of this rule. The unique case, than the process is 
non-deterministic regards the uniqueness requirement. Since the 
target relations defined by the smart mapping rules with uniqueness 
requirements are not involved in Xr, their generated instances will 
trivially satisfy XT. 

Once generated all possible tuples of T, the next step consists in 
verifying whether the tuples in T satisfy all target constraints. This 
check is obviously performed in polynomial time. 
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COROLLARY 1. Given a Semi-deterministic Smart Data Posting 
setting (S, T, X, 0) and a finite source instance Is for S, a finite in- 
stance Ir of T such that (Is, Ir) satisfies X always exists and can be 
found in polynomial time (under the data complexity). 


Proof. Straightforward from Theorem 2. 


5 CONCLUSION AND FUTURE WORK 


In this paper we presented a simplified version of Data Posting 
framework, based on the use of smart mapping rules. We showed 
that the data posting problem in the new setting is NP- complete and 
identify the conditions when this problem becomes polynomial even 
in the presence of non-deterministic choices. 

The proposed approach have been tested in a real scenario within 
the MISE Project Data Alliance (D-ALL). More in detail, we imple- 
mented a prototype that leverages the proposed framework in order 
to propose users a set of interesting analysis dimensions. Users can 
validate the proposed dimensions, in that case they are added to the 
system knowledge base. Our early experiments are quite encourag- 
ing and will be deeply refined as a future work. 

The simplification of data posting framework proposed in this 
paper is based on the idea to integrate local selection criteria in 
the mapping process. The mapping rules, or similar formalism can 
be also profitable used in different logic-based settings (i.e., P2P 
Deductive Databases [6, 7], prioritized reasoning in logic program- 
ming [5, 21], efficient evaluation of logic programs [14, 18], etc.). 
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ABSTRACT 


Predicting the number and the type of operations by civil protection 
services is essential, both to optimize on-call firefighters in size 
and competence, to pre-position material and human resources... 
To accomplish this task, it is required to possess skills in artificial 
intelligence, which are not usually found in a medium-sized fire 
department. However, such a request may be mandated, for ex- 
ample from specialized companies or research laboratories. This 
mandate requires the transmission of potentially sensitive informa- 
tion relating to interventions which is not intended to be publicly 
available. The purpose of this article is to show that a machine 
learning tool can be deployed and provide accurate results, using a 
learning process based on anonymized data. Learning on real but 
anonymized data will be performed using extreme gradient boost- 
ing, and the performance of each anonymization will be compared 
on the number and of interventions per day, and their type. 


CCS CONCEPTS 


* Information systems — Data stream mining; « Security and 
privacy — Usability in security and privacy; «Computing method- 
ologies — Spatial and physical reasoning; Neural networks. 
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Data Privacy, Data anonymity 
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1 INTRODUCTION 


For various economic and societal reasons, such as the aging of 
the population, the closure of small rural hospitals, or the disen- 
gagement of the private sector (ambulance drivers) for acts that 
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are not economically interesting, French fire brigades are facing a 
constant increase in the number of interventions. However, due to 
the economic crisis and the state debt, the resources allocated to 
public services in general, and to the fire brigade in particular, are 
not increasing on their side. The latter must therefore find origi- 
nal solutions to meet growing demand in constant resource. One 
solution for the future is to optimize the use of their human and 
material resources, by pre-positioning vehicles and adapting the 
size of the guards according to the number, type and location of 
intervention that an artificial intelligence algorithm could predict. 

This solution requires, on the one hand, a database of past inter- 
ventions that is sufficiently rich and consistent, and on the other 
hand, know-how in a constantly evolving scientific discipline. This 
knowledge base is naturally present within the departmental fire 
and rescue service (SDIS), which collects, for legal and statistical 
purposes, many data related to each of their interventions. This 
database contains information on the dates, places and types of 
interventions, as well as on the interveners and victims. However, 
if the SDIS has this basis of knowledge useful in the learning phase 
of an artificial intelligence algorithm, it has neither the know-how 
nor the human resources to implement such an algorithm. 

Indeed, such a realization implies the recovery of explanatory 
variables by scripts automatically retrieving internet information 
on past meteorology, ephemerides, epidemiological data, etc. Se- 
lecting models from among the various machine learning methods 
based on decision trees or artificial neurons, as well as feature se- 
lection to reduce model complexity, requires time and up-to-date 
knowledge of machine and deep learning techniques. Similarly, 
finding good values for algorithm hyperparameters, or proposing 
resource optimizations based on predictions made, requires the 
work of computer researchers specialists in artificial intelligence, 
high performance computing, and optimization. 

If the basis of knowledge, with the personal data it contains, is 
legally protected as long as it remains within the SDIS, its complete 
transmission to another institution, even if it remains public, is prob- 
lematic, at least legally. Therefore, the data must be de-identified 
and then processed by academics, with no intention of public disclo- 
sure. However, if anonymization of the data is mandatory to allow 
such transmission from SDIS to the university, this anonymiza- 
tion should not make the data unusable for any type of prediction. 
In other words, a fair compromise should be found between the 
protection of private information contained in the database and 
the amount of preserved information useful for machine learning 
algorithms. In fact, the question of whether such a compromise 
exists and can be found is worth asking. 
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The objective of this article is to present a concrete case of fine 
optimization by state-of-the-art techniques, making it possible to 
guarantee both a sufficiently high privacy given the context (private 
exchange between fire brigades and academics), while allowing 
better predictions than what could be obtained with traditional 
statistical tools. It is therefore a proof of concept on a concrete 
case study from the SDIS 25 (firemen from Doubs department in 
France), showing that a fair compromise is possible, allowing a 
future optimization of firefighters’ resources without paying for it 
by potential leaks in privacy. 

The rest of this article is structured as follows. The case study 
is presented in the next section, which contains a description of 
the data under consideration. Section 3 focuses on the problem of 
de-identification with an overview of most important methods that 
have been applied on this case study. The database that has been 
anonymized is then used to learn and predict firemen interventions 
in Section 4. This article ends by a conclusion section, in which the 
contribution is summarized and intended future work is outlined. 


2 DATA PRESENTATION 


The data we have to conduct the forecasts are classified by year 
between 2012 and 2017. Each intervention of the fire fighters of the 
fire brigade of the Doubs department (a French county of 500,000 
inhabitants) is recorded in a file in the form of a line. The attributes 
of this file are shown in the Table 1 and described as follows: 


ID Station Reason | Commune SDate 
0 | Belfort South | Malaise Belfort 2018/01/31 08:35 
Age | Gender SAD Type Destination 


45 Male | NoCRA | Other | Belfort Hospital 

Doctor Condition Location 
No Severe Injury | (47.616, 6.857) 

Table 1: Attributes of fire brigade operations data 


ID is the ID intervention, which is used in supplementary 

files; 

e Station is the fire station name; 

e Reason is the initial reason for the firefighters’ intervention; 

Commune is the name of the municipality where the opera- 

tion took place; 

e SDate is the starting date of the intervention 

Age Gender and Type is the age, the gender of the victim, 

and whether it is a fireman or not; 

e SAD indicates whether a Semi-Automatic Defibrillator has 

been used; 

Destination gives the subsequent destination, i.e. the place 

where the firefighters transported the victim later; 

e Doctor specifies whether a doctor was present at the victim's 
location; 

e Condition states the victim's condition at the end of the 
operation; 

e Location gives the precise location (latitude, longitude) of 

the intervention. 
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The Table 2 gives the number of interventions by firefighters per 
year. As can be seen in this table and as stated in the introduction, 
the number of firefighters' operations is constantly increasing. 


Year | Number of operations 
2012 22,960 
2013 24,562 
2014 26,026 
2015 27,750 
2016 28,880 
2017 31,715 
Total 161,813 


Table 2: Number of interventions by firefighters per year 


3 DE-IDENTIFICATION PROBLEMS 


This section shows how the fire brigade data were de-identified in 
order to first predict the number of interventions (Sec. 3.1), then to 
give the kind of intervention (Sec.3.2). 


3.1 Number of interventions per fire station by 
time slot 


The objective of this first part is to have firefighters in each center 
always in adequacy with the interventions to be carried out by 
the center's personnel. To know the number of firefighters present 
and/or available in each center, it is necessary to have an idea of the 
number of interventions per year, per month, per week, per day, per 
3-hour block in each of these centers. The objective is to publish 
data in the form of tuples (SDate, Station, Operations) where SDate 
is the time interval (with variable amplitude as discussed before), 
Station is the station name, or a generalization. Finally, 4Operations 
represents the number of actions performed by the fire fighters of 
the Station unit(s) during the time interval SDate. 

In small rural centers, where the number of interventions is 
naturally low, it can happen that the hourly amplitude of the study 
is too low compared to the number of interventions carried out by 
the fire brigade of the latter and therefore not significant enough 
to be generalized. It is therefore natural to think of grouping these 
stations together at the level of the urban community to obtain 
events that are sufficiently representative in number. 

The first question here is: is the number of interventions a sen- 
sitive attribute? Clearly yes. This gives importance to a fact. The 
movement of the Fire Brigade would not take place if the situation 
had not been critical. For example, if it is known, on the one hand, 
that a person was sick in a small village and equipped with a centre 
and that this centre performed an intervention during this period 
(whereas it almost never does), then there is a high probability that 
the fire brigade intervened for this person and therefore that the 
illness worsened. 

Two anonymization approaches are used here as direct appli- 
cations of existing methods. The first is k-anonymity [8] and the 
second is differential confidentiality [2]. 


3.1.1. A k-anonymous de-identified dataset. In order for the article 
to be self-sufficient, we recall here the definition of the k-anonymity 
requirement. 
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Definition 3.1 (k-anonymity requirement[8]). Each release of data 
must be such that every combination of values of quasi-identifiers 
can be indistinctly matched to at least k individuals. 


In other words, for a given dataset with at least k equivalent 
records, the probability of re-identifying an individual, for any 
known given attack A is less than 1/k. 

Thus, only triplets (SDate, Station, #Operations) such as the num- 
ber of Operations is greater or equal to k will be provided for fur- 
ther analysis. The others, (i.e, when #Operations < k) will not be 
used in the further prediction step (since they are removed from 
the dataset), reducing thus the approach accuracy. This raises the 
question of choosing a value for the k parameter: a high value 
decreases the overall probability of re-identification but results in 
a loss in the data’s accuracy. The chosen value k has to ensure an 
acceptable risk of re-identification for any kind of attack A, i.e., 
P(re-identification|A) < ct is lower than a given value. 


P(re-identification, A) i 
PA (1) 
has to be evaluated for each kind of attack A, namely deliberate 
attempt at re-identification, acquaintance (i.e, inadvertent attempt), 
or breach. For each kind of attack A, the following probability 
P(re-identification, A) must be lower than a commonly acceptable 
threshold T. Quoting [5], since the dataset will be distributed to 
researchers only, the average risk threshold T is set to 0.1. 

In our context, researchers belong to an academic institution 
with a confidentiality data agreement, without any particular in- 
tent to re-identify records. It is recognized in such a case that 
P(Deliberate Attempt) < 0.4. The third attack, (breach) can take 
place if the university loses the dataset. According to [4], it results 
that P(Breach) = 0.27. We are then left to evaluate P( Acquaintance). 

The whole dataset is composed of less that 162,000 operations 
in the Doubs department (composed of 500,000 inhabitants) which 
may concern the same individual. The probability of an individual 
not to be in this dataset is about 1-162000/500000=0.676. Since the 
average estimated number of well-known contacts is 150, the prob- 
ability that none of them are in the dataset is approximately equal 
to 0.67699. which is very close to 0. In this context, the probability 
of acquaintance is thus equal to 1, i.e, P(Acquaintance) = 1. 

The higher the value of P(A), the smaller P(re-identification|A) 
and the more de-identification is required on the data set. One thus 
have to ensure that P(re-identification|A) < 94, ie. k = 11. 


P(re-identification|A) = 


Attribute | Generalization Hierarchy 


SDate | date-hh:mm:ss — 3-hours — day — week — month 


Station |station Name — urban community — county 


Table 3: Generalization hierarchy for number of interven- 
tions per fire station by time slot 


A generalization approach can be applied on both attributes 
SDate and Station and is represented on Table 3. It is a list of simpli- 
fications which can be applied to attribute values, ordered from the 
smallest intervals to most general ones. Counting the number of 
operations in an urban community rather than a fire station aims 
to reduce the number of deletion in the data set to allow for better 
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learning and prediction: there are fewer cases in an urban commu- 
nity than in a fire station where the number of operations per time 
interval will be less than k. Of course, the results of the predictions 
will be given at the level of these communities, but many firefight- 
ers live in the metropolitan communities and can move to another 
fire station if needed. Table 4 gives results of 11-anonymity dataset 
with respect to the generalization parameters. In each cell, the first 
number gives the rate of suppressed records whereas the latter is 
the entropy value [5] expressed as a percentage (compared to the 
maximum possible entropy for the data set). It can be deduced that 
the generalization of the starting date to the day, and the fire station 
to the urban community gives acceptable results both in terms of 
records loss and entropy. 


3.1.2 Ae differential private dataset. Differential privacy [2, 3] is 
property of anonymization technique that minimizes the privacy 
impact on individuals whose information is in the database. From 
a probabilistic point of view, it is not possible for an attacker to 
identify sensitive data about an individual if his/her information 
were removed from the dataset. Practically, it may be implemented 
as noise addition to query results. 

Let f be the function that associates to each fire brigade its 
number of interventions at a given time. If an operation by firemen 
of this station is deleted, the impact is exactly 1 and the sensitivity 
of f, usually denoted as Af, is thus equal to 1. It has been proven 
that a mechanism that returns f(x) + y where y is the added noise 


that follows a Laplacian distribution (0, ary is e-differential private. 
A high value of e leads to small value noise and induces thus a low 
guarantee of privacy. On the opposite, a small one provides a high 
probabilistic guarantee against attacks. We are then left to assign a 
value for the £ parameter with the goal of hiding any individual's 
presence in the dataset. 

According to [6], the value of ¢ should be bounded by 


e< Af ja =De 


2 
s ere (2) 


where n is the number of lines of our dataset (i.e, at least n = 
22,960), ^v is the longest distance between two datasets where 
a line has been removed each time (i.e, Av = 2), and p is the 
probability of being identified as present in the database. To be 
coherent with Sec. 3.1.1, p is set with 0.1. In such a case, e should 
be lower than 3.92. The value ¢ = 1 has been retained here. 


3.2 Nature of firefighters’ interventions by 
time slot 


To optimize the material and human resources present in each sta- 
tion or urban community, it would be interesting to predict the types 
of interventions by time interval (year/month/week/ week/day/3- 
hour block) in each area of interest (station, urban community, 
department). 

For a particular time block of a given amplitude, the types of 
tasks executed (the reason attribute) are extracted from the data 
set. The cardinal of this set (in which the equal types are deleted) 
is naturally lower than the number of interventions found in the 
Section 3.1. The nature of these interventions is clearly a sensitive 
data. 
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SDate 3-hours Day Week Month Year 
Fire station 99.8/0.0 | 99.6/23.1 | 64.3/70.6 | 27.6/60.9 | 5.2/75.4 | 0.2/99.8 
Urban Community | 99.8/0.0 | 96.9/23.1 | 32.7/32.8 7.8/61 0.7/75.5 | 0.0/99.9 
County 99.8/0.19 | 38.0/38.9 0/42.1 0/61.1 0/75.6 0/100 


Table 4: Number of interventions : anonymization by generalization and 11-anonymity 


There are ~400 different reasons in the database for firefighters 
to be involved, some of them overlapping or are very similar to 
each other. During each intervention, the reason for departure 
is indicated at the beginning of the intervention, i.e. often in an 
emergency context, leading to a certain number of errors. The finer 
the granularity, the more errors there are in an emergency situation. 
To improve data quality, the reasons for firefighters to leave are 
therefore regrouped into 7 classes that ares personal assistance, 
road rescue, another accident, fire or explosion, various operations, 
preventive operations, other reasons. This is like applying a low- 
frequency filter. Once again, it is a question of finding the right 
compromise between the usefulness of the data and their quality. 
In all of the following, we only considered data resulting from the 
grouping in accordance with this filter. 


3.2.1 Recursive (c, l)-diversity. Publishing the types of interven- 
tions is critical because if they are not varied enough, then this 
information can be misused and led to a positive or a negative 
disclosure. For example, if all the outings that took place on a given 
date involved heart ailments and if we know that a person was 
rescued by firemen on that day, we deduce that they had a heart 
attack. This is the problem identified by Machanavajjhala et al. and 
named [-diversity [7]. 

Intuitively, a group of records (bloc, equivalent class) is said to 
be l-diverse if there are at least | “well-represented” values for the 
sensitive attributes (which may be a single sensitive attribute, a 
pair of sensitive attributes, ...). The dataset is said to be [-diverse if 
each group of records is I-diverse. The notion of "well-represented" 
is intentionally ambiguous. The fact that l separates values is not 
sufficient for this definition. A potential refinement could be that 
the current values are distributed according to a law approaching 
uniform distribution. We then find the notion of Entropy l-diversity. 
However, this constraint is often overly restrictive. 

We prefer to take a less restrictive refinement that stipulates that 
the ratio between the most represented value and the sum of the 
least m — | + 1 represented ones is less than a constant c provided 
by the user. This definition is known as recursive (c, l)-diversity [7] 
and is recalled here. 


Definition 3.2 (Recursive (c, l)-Diversity). In a given q*-block, let 
ri denote the number of times the it? most frequent sensitive value 
appears in that q*-block. Given a constant c, the q*-block satisfies 
recursive (c, I)-diversity if ri < e(r; + rj44 +--+: rg). A table T 
satisfies recursive (c, I)-diversity if every q*-block satisfies recursive 
(c, 1)-diversity. 


The higher the c number is, the more frequently this property is 
established. 

From the experiments in Section 3.1.1 concerning 11-anonymity, 
we focused on the generalization of fire stations at the level of the 
agglomeration community and the date of intervention at the level 
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of the day. For this given generalization, we varied | and c both in 
{2, 3, 4,5, 6}. Results are summarized in Table 5 where for each pair 
(L, c), the rate of suppressed records is given. 

In this table and not surprisingly, the number of deleted records 
is decreasing with respect to c, but increasing w.r.t. of l. Results of 
de-identification satisfying recursive (5, 2)-diversity has been thus 
retained because it is the only one that does not delete all the data. 


| 5 100 | 100 | 100 | 100 | 100 
Table 5: Reasons of interventions: rate of suppressed records 
with anonymization by generalization, 11-anonymity and re- 
cursive (c, l)-diversity 


3.2.2 Differentially Private Histogram of Operations. In [9], Xu et 
al. show how to publish a differentially private compliant histogram 
which outputs the distribution of a random variable, such as the 
number of operations with respect ot the attribute Reason of inter- 
vention. 

The approach is twofold. In the former, for a given time slot 
(a whole day, e.g.), a histogram is constructed representing the 
number of interventions performed during this period and whose 
values are grouped according to the reasons for the intervention 
of the fire brigade. In the latter, a noise is added with unit-length 
bins, using the Laplace Mechanism. The resulting histogram is thus 
published for analyzes. The clustering of reasons for departures 
into 7 classes (as presented in the beginning of this section) was 
particularly guided by this step. Indeed, without this grouping, a 
histogram with potentially 400 bars can be constructed (there are 
approximately 400 different reasons of intervention, as presented 
at the beginning of this section). Even with the addition of very 
low random noise, some of the reasons may appear when they are 
not at all correlated with an event. By grouping the reasons into 7 
classes, the granularity is certainly less, but the added noise is still 
meaningful. 

As in Sec. 3.1.2, e should be chosen to respect privacy concern. 
Even if the request executed here on the database is different than 
the one given in this section, all the variable values of Equation (2) 
are the same leading to a bound for ¢ which is 3.92. In what follows, 
€ is thus set again with e = 1. 
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4 MACHINE LEARNING PREDICTIONS 


4.1 General presentation 


The objective of this section is now to evaluate whether it is pos- 
sible to make predictions about the activity of firefighters from 
anonymized files. We will focus on the number of interventions per 
unit of time, the type of intervention, and the solicitation per centre. 
In each case, predictions based on anonymized data will be com- 
pared to those based on raw data. More specifically, we will look at 
whether, from the anonymized data of 2013-2017, we can find out 
what happened in 2012, as described in the anonymized file of 2012. 
This predictive ability will be compared to the score obtained by 
predicting the year 2012 (not anonymized) from the learning on the 
raw data for 2013-2017. Finally, the 2012 prediction based on the 
anonymized 2013-2017 data will be compared to the de-anonymized 
2012. Note that we have chosen to predict 2012 from 2013-2017, and 
not 2017 from 2012-2016, because the year 2017 saw its number of 
interventions explode due to a disengagement of the private sector 
(ambulance drivers) artificially inflating firefighters’ interventions, 
and this for reasons that are difficult to predict because they are no 
longer linked to human activity: instead of predicting the future, 
we are reconstructing a potentially unstored past. 

In order to achieve this supervised learning, we had to recover a 
collection of explanatory variables that could potentially explain 
the number, type and location of interventions. We have assumed 
that these interventions are directly related to human activity (for 
example, there is less intervention at night, because people sleep), 
which itself changes according to the time of year (holidays, sea- 
sons...), the weather, etc. These explanatory variables, for each hour 
of the period under consideration, are publicly available on the 
Internet, and have enabled us to recover with some precision the 
2012 interventions from those of 2013-2017. 

In detail, the following numerical variables were recovered from 
the MétéoFrance site for the three weather stations closest to the 
Doubs (Nancy, Dijon and Basel): wind direction, humidity, dew 
point, precipitation during the previous hour, and during the last 3 
hours; pressure, and its variation lods over the last 3 hours, tem- 
perature, wind speed, and finally visibility. At the calendar level, 
we have added the year, month, day in the week (Monday...), in 
the month (1.2, ..., 31) and in the year, in order to identify days 
different from the normal (national holiday, Christmas...). Epidemi- 
ological data have also been added on the incidence of influenza, 
chickenpox and diarrhoea over the past week, collected from the 
Sentinel network. Finally, since the Doubs department is rich in 
mountains, forests and rivers, in a temperate region, we occasion- 
ally have heavy rainfall leading to sudden variations in the height 
of the rivers. The latter lead to floods, requiring assistance to people. 
Also, the heights of six rivers have been added, with their variations 
over the past hour. 

In the following, we will present the prediction results from 
approaches that can be obtained using the explanatory variables 
on original data, and finally what is found using the anonymized 
version of the data. Finally, it should be noted that the machine 
learning algorithm used here is the extreme gradient boosting (XG- 
Boost), with the default values as hyperparameters [1]. For each set 
of prediction attempts, 5 experiments have been done with distinct 
seeds for random initialization. Each curve represents the curve 
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of the means and the standard deviation is always displayed with 
vertical bars. 


4.2 Predicting the number of interventions 


In this section, the objective is to predict the number of interven- 
tions per fire station by time slot. The de-identification has shown 
that an acceptable trade-of between the number of suppressed data, 
the entropy and the duration of the study is obtained by merging 
data inside Urban Communities and for a duration equal to the day 
(see 4). 

Let us first recall that ensuring 11-anonymity had a cost: a num- 
ber of lines have been deleted. More precisely, 32.7% of the interven- 
tions were removed, and therefore a factor multiplying the number 
of predicted interventions by 1/(1-32.7/100) = 1.486 will be consid- 
ered. For this method, this corresponds to an adjustment achieved 
by a systematic increase in forecasts of about 49%. 


— Actual 
— DP1 
—L DP10 
— Kil 
— Raw data 


0 1 2 3 B 5 6 


(a) Grand Besancon, 


8] —— Actual 

7 — DP1 
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—L- Raw data 

5 
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3 
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1 


(b) Grand Pontarlier, 
Figure 1: Predictions Week # 2, 2012 


Three anonymisation methods have been applied, namely 11- 
anonymity and Differential Privacy with e = 1 and e = 10. Figure 1 
and all subsequent ones compare the results of predictions from the 
original data, anonymized data using these three methods, and the 
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mean (which is the simplest prediction). Each time, these predic- 
tions are given for the week between January 8 and 14, chosen as 
an arbitrary example. All these results focus on two agglomeration 
communities, namely Grand Besancon and Grand Pontarlier. These 
two agglomerations were chosen because their demographic char- 
acteristics are significantly diverse. The former has about 200,000 
inhabitants living in 68 municipalities, mainly urban over a surface 
area of 528.6 km2, representing approximately 379 inhabitants/km2. 
The latter is composed of approximately 27,000 inhabitants who 
live in 10 municipalities, mainly rural, over an area of 154 km2, that 
is, approximately 175 inhabitants/km2. 

In this figure and in all the following, Actual (in purple) always 
denotes what really happens during this week. Concerning predic- 
tions, K11 (in green) is the average curve of predictions through 
11-anonymity concept. DP1 and DP10 are the mean curves after 
Differential Privacy based anonymization with e = 1 and e = 10 
respectively. Red curves represent data forecast from original row 
data (i.e, non de-identified data). 

Let us explain results obtained for the agglomeration of Grand 
Besancon (Figure 1(a)). First of all, the average number of interven- 
tions for this agglomeration is 28.1 with a standard deviation of 7.7. 
The Mean Absolute Error (MAE) when considering this average as 
a prediction of reality is 6.6. Any prediction based on intelligence 
must reduce this error. 

Note that all predictions based on Differential privacy and on 
row data are consistent: the standard deviation for each prediction 
is about 6.6. Here, the prediction with data anonymized with 11- 
anonymity is over-estimated: as already announced, this method 
has lead to a suppression of 32.7% of data indeed. But only a few of re- 
moved data concerns Grand Besancon and the forecast accuracy for 
this urban community was thus sufficient enough. However all the 
predictions with this generalization based anonymization method 
have been increased by 49% leading here to a over-estimation. In 
this agglomeration, even if the predictions are not extremely accu- 
rate, we can see that they follow the same trend as reality: a relative 
decreasing until the middle of the week with a more or less rapid 
ascendancy thereafter. The mean average errors w.r.t the chosen 
anonymisation method are reported in Table 6. It can be seen in 
the latter that the predictions on data anonymized by Differential 
Privacy have the same level of accuracy as those from the original 
data. 

The results are much less homogeneous for the Pontarlier urban 
community (Figure 1(b)). In this case, the predictions from the data 
anonymized by 11-anonymity are far below reality and other pre- 
dictions. This is explained by the fact that in this urban community, 
the average number of interventions for this week is 4.7 with a 
standard deviation of 2.6. Many data concerning this agglomeration 
community are thus deleted by the 11-anonymity method. The 
average number of interventions using this latter anonymization 
method is indeed 1.7, after the adjustment of the data by 1.5. How- 
ever, this result is far below reality. The other approaches based on 
Differential Privacy anonymization methods give forecast which 
are in the consistent order of magnitude. Regarding the MAE of 
predictions (Table 6) and as in the other agglomeration commu- 
nity, predictions are as accurate when using data anonymized by 
Differential Privacy as when embedding raw data. 


Jean-Francois Couchot, Christophe Guyeux and Guillaume Royer 


Grand Besancon | Grand Pontarlier 

Average number Don 47 
of intervention 

Mean 6.6 2.0 

11-anonymity 16.0 3.6 

DP (e = 1) 5.6 1.9 

DP (e = 10) 5.5 19 

Raw data 5.7 1.9 


Table 6: Mean Average Error with respect to anonymization 
method 


Another positive point is that we find, in general, the same 
relative importance of each explanatory variable: the same causes 
explaining the number of interventions are highlighted (causality 
is not confused): the five most important features as provided by 
the plot importance function (namely, the year, wind direction, 
day in the year, humidity, and water level of the Doubs River) are 
the same, but not in the same order. Let us also note to relativize 
that, on anonymized data, we obtain predictions that are not totally 
meaningless (compared to the average), while: 


e no model selection (choice of the machine learning algo- 
rithm) has been performed; 

e no preliminary step was taken to select explanatory vari- 
ables; 

e noattempt was made to optimize the many hyperparameters 
of the XGBoost. 


4.3 Predicting the nature of interventions 


In this set of experiments, two anonymisation methods have been 
applied. The former is 11-anonymity combined with recursive (5, 2)- 
diversity and the latter is histogram of operations compliant with 
Differential Privacy (with e = 1 and e = 10). For the same reasons 
as above, this study focuses on the two agglomeration communities, 
Grand Besancon and Grand Pontarlier. This article focuses only 
on two types of intervention, namely personal assistance and road 
rescue. Personal assistance is indeed very frequent and can usually 
be managed by several services: the SAMU, private ambulances 
and fire brigades. In contrast, road accidents are more infrequent 
(and predictable with probably less accuracy), but are systematically 
handled by firefighters. Results of predictions are shown in Figures 2 
and 3. The former deals with personal assistance whereas the latter 
focuses on road accidents. As in the previous section and for the 
same reasons, this figures focus on two agglomeration communities, 
namely Grand Besancon and Grand Pontarlier and the color codes 
are the same than in previous section. 

Let us first focus on personal assistance. As in the previous 
section, the number of interventions realized for this reason of 
departure is overestimated when anonymization is achieved by 11- 
anonymity and recursive (5,2)-diversity when the urban community 
is Grand Besancon and underestimated otherwise. It happens that 
data containing this reason may be deleted by this method. 

For a medium-sized urban community such as Besancon, the 
trend is observed also on anonymized data, even if it is slightly 
overestimated. This is explained by the fact that the number of 
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Figure 2: Personal assistance, Predictions Week # 2, 2012 


interventions for this reason of intervention has increased steadily 
over the years (between 2012 and 2017) and that 2012 is therefore 
the year in which these exits have been the least numerous. For 
the small urban community of Pontarlier, the trend is also found 
even on data anonymized by the method combining histograms 
and differential confidentiality. For e equal to 1 (which guarantees 
acceptable safety), predictions close to the mean are found. 

Road accidents are quite uncommon and therefore more difficult 
to predict, especially in small rural communities. In this context, it 
seems even less relevant to apply the 11-anonymity and recursive 
(5,2)-diversity based anonymization method to make subsequent 
predictions. This is confirmed by the curves of the figures 3(b) and 
3(a). The standard deviation for this anonymization method are 
indeed very large, and forecast a very far from reality. 

The noise added by the method combining histograms and differ- 
ential confidentiality is sufficiently limited even when e-1. Indeed, 
the general trend is found with almost as much precision on data 
anonymized by such an approach as on raw data, i.e, on non- 
anonymized data. This trend is numerically validated by the values 
given in Table 7, which summarizes the mean absolute errors by 
agglomeration community, by type of intervention and according 
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Figure 3: Road intervention, Predictions Week # 2, 2012 


to the chosen anonymization method. As in the previous table on 
prediction errors concerning the number of interventions, it can 
be seen here that the histogram method with differential privacy 
allows to obtain predictions as precise as those obtained on raw 
data. 


Grand Besancon | Grand Pontarlier 
Personal | Road | Personal | Road 
assistance | accident | assistance | accident 
Average mumber 24.1 34 4.0 0.6 
of intervention 
Mean 5.6 2.0 1.7 0.8 
boe dM 7.6 3.9 3.3 0.9 
recursive (5,2) diversity 
DP (e =1) 4.5 2.0 1.7 0.8 
DP (e = 10) 4.6 2.0 1.7 0.9 
Raw data 4.5 2.0 1.6 0.8 


Table 7: Mean Average Error with respect to anonymization 
method 
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5 CONCLUSION 


" 


Can we predict and with which accuracy the number (1) and 
nature (2) of firefighters' interventions in a geographical area while 
respecting the privacy of the victims they rescued?" This article 
is a positive answer. In both the quantitative (question (1)) and 
qualitative (question (2)) domains, this article shows that differential 
confidentiality based approaches provide more accurate results than 
generalization and suppression ones. It is possible to use privacy- 
respecting (i.e, properly anonymized) data to guess an accurate 
behavior. 

It should be noted that the variable e was deliberately set to 1 
to ensure a high level of privacy. By increasing this value (up to 
the calculated threshold 3.9), the obtained results would have been 
even more accurate. 

The prospects for this work are numerous. We will first study the 
possibility of predicting the places of intervention, knowing that 
this attribute is very critical, because it almost allows the victim to 


be identified. 


This study has been supported by the EIPHI Graduate School (con- 
tract "ANR-17-EURE-0002"), by the Interreg RESponSE project, and by 
the SDIS25 firemen brigade. 
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ABSTRACT 


As our society has become more information oriented, each in- 
dividual is expressed, defined, and impacted by information and 
information technology. While valuable, the current state-of-the- 
art mostly are designed to protect the enterprise/ organizational 
privacy requirements and leave the main actor, i.e., the user, un- 
involved or with the limited ability to have control over his/her 
information sharing practices. In order to overcome these limi- 
tations, algorithms and tools that provide a user-centric privacy 
management system to individuals with different privacy concerns 
are required to take into the consideration the dynamic nature of 
privacy policies which are constantly changing based on the infor- 
mation sharing context and environmental variables. This paper 
extends the concept of contextual integrity to provide mathematical 
models and algorithms that enables the creations and management 
of privacy norms for individual users. The extension includes the 
augmentation of environmental variables, i.e. time, date, etc. as part 
of the privacy norms, while introducing an abstraction and a partial 
relation over information attributes. Further, a formal verification 
technique is proposed to ensure privacy norms are enforced for 
each information sharing action. 


CCS CONCEPTS 


e Security and privacy — Logic and verification; e Computer 
systems organization — Embedded systems; Redundancy; Ro- 
botics; e Networks — Network reliability. 
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1 INTRODUCTION 


A Privacy Bill of Rights was endorsed by the White House in 2012, a 
response to an increasingly loud objection of citizens on the lack of 
privacy and fair information practices guidelines [20]. The predica- 
ment was not only recognized by the US government, but has also 
been investigated and studied at the international stage and has re- 
sulted in reports such as "Rethinking personal data: Strengthening 
trust" by the World Economic Forum (WEF) [40] and "Recommen- 
dations for businesses and policymakers" by the Federal Trade 
Commission (FTC) [12]. Despite all these efforts, ubiquitous online 
monitoring of users’ activities [29] and scandalous data breaches, 
i.e. Facebook and Cambridge Analytica, continue to haunt Online 
Social Network (OSN) users [2, 11]. These privacy breaches are 
often due to a lack of regulatory standardization. Hence, the onus is 
on the user to take control of: what types of information should be 
shared with whom and when. However, controlling and managing 
the information sharing parameters could be a cumbersome and 
difficult process [15, 21, 44]. Therefore, ample tools and algorithms 
should be developed and provided to users so they are able to define 
and enforce their own customized, unambiguous privacy policies 
and have control over how their information is shared. The state-of- 
the-art research on privacy management mostly consist of: access 
control languages [4, 33, 39], different privacy settings in applica- 
tions, and formal privacy policies [5, 10, 14, 22, 36]. While valuable, 
the previous works are mostly based on enterprise/organizational 
privacy management and leave the main actor, i.e., the user, unin- 
volved or with limited ability to control the information sharing 
parameter. In addition, existing privacy regulations like HIPAA 
or a corporation’s privacy policies are domain-specific and static 
with a little or no change over time. On the other hand, the user’s 
privacy policies are dynamic and changing based on many factors, 
i.e. context, environment, relationship status, etc. In addition to 
dynamicity, the privacy framework should provide the user with 
the ability to adapt the policies to their own personal needs, since 
the definition of privacy varies from person to person based on 
their personality, cultural background, etc [30]. 

In order to move towards a more practical solution, this paper 
proposes a framework to build a user-centric privacy management 
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system. We focus on developing the main core of this framework, 
which is the privacy formalization and verification engine that allows 
for the guided and flexible specification of users’ privacy intentions. 
The formalization and verification engine performs formal rea- 
soning about the user’s privacy rules to detect privacy violations. 
Further, the proposed approach ensures that the defined privacy 
policy is unambiguous and a consistency checking approach is 
proposed so that all the exiting and newly defined policies are 
consistent with one another. The underlying formalization utilizes 
two formal models: 1- the user’s information sharing model, and 
2- the privacy-preserving model. The user’s information sharing 
model represents all the user’s information sharing activities to oth- 
ers. The privacy verification is performed by mapping each user’s 
information sharing parameters (known as a state) to a state in 
the privacy-preserving model; a state with no mapping indicates a 
privacy violation. As a proof of concept, the privacy formalization 
and verification engine is implemented as a Java program that de- 
tects privacy violations as the user shares information in real-time. 
Since this framework is targeted for smart devices, which usually 
have low memory and low processing power, its performance was 
evaluated on both a PC and a Raspberry Pi model B to show the 
practicality of our approach. 

The future work will extend the current effort to include: user 
privacy requirement elicitation, identification and categorization of 
the information shared by users, and detection of the relationship 
changes between a user and recipients. 

The rest of the paper is organized as follows: section 2 provide 
the related works. Section 3 has a detailed description of our for- 
malism and verification engine, and the implementation details of 
our framework are given in section 4. Moreover, the performance 
evaluation of the proposed framework is given in section 5. Finally, 
section 6 draws the conclusion of this paper and discusses the future 
works of our approach. 


2 RELATED WORKS 


For over 120 years researchers have studied privacy in different 
settings of technological advances [41, 45]. The first privacy theory 
emerged when newspapers started to publish personally intru- 
sive articles and photographs[41]. This led to seclusion and non- 
intrusion theory of privacy that defined the user's privacy as "the 
right to be left alone" [45] or being free from intrusion [18]. As new 
technologies were introduced such as databases containing the per- 
sonal information of the users [41] the information-related privacy 
concerns [38] emerged. To address these concerns researchers devel- 
oped the control [46], limitation [16], and Restricted Access/Limited 
Control (RACL) [32] theories to enable users to control and limit 
their privacy while share information with others. In RACL the- 
ory, the user's privacy is implied as “a situation with regard to 
others [if] in that situation the individual...is protected from in- 
trusion, interference, and information access by others.” [42] The 
control, limitation and RACL theories assume a rigid definition 
of privacy, while in the current technological era the meaning of 
privacy changes based on the societal norms. To address this issue, 
Nissenbaum proposed the Contextual Integrity (CI) theory of pri- 
vacy, [34] where privacy behaviors are affected by the context of 
the information sharing environment. 
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To implement the above theories, privacy policy languages were 
created based on the theories of limitation, control and RACL. The 
early privacy languages were either created by augmentation of 
access control languages or have the same structure of specify- 
ing policies as a set of access roles and information categories 
in a structured format like Extensible Markup Language (XML). 
Some well-known examples of such Languages are Platform for 
Privacy Preferences Project (P3P) [39], Enterprise Privacy Autho- 
rization Language (EPAL) [4], eXtensible Access Control Markup 
Language (XACML) [33], and Confab [19]. The early version of 
these languages lacked temporal modalities that were solved in 
the extended versions of them such as adding spatio-temporal at- 
tributes to XACML [27, 35, 43]. 

Another common formalism for privacy is based on transition sys- 
tems where the policies are specified as action and state of infor- 
mation sharing. Formalizing privacy policies were based on the 
privacy-preserving and privacy-violating actions in the system. 
Also, in this formalism, the temporal characteristic of privacy was 
modeled using Linear Temporal Logic (LTL). Lu et al. [28] proposed 
a technique that translated the privacy specification of web ser- 
vices to LTL formulas. Then a Privacy Interface Automata (PIA) 
is presented to transform the messaging structure extracted from 
the web service business process execution language (WS-BPEL) 
into an automaton, creating their privacy policy model. Krishnan et 
al. [26] also proposed an approach to enforce privacy requirements 
using role-based access control and LTL. Their technique contains 
behavior automata that model the system behavior (gathering or us- 
ing data) and an access control automata which enforce the privacy 
policies. Kouzapas et al. [25], combined the z-calculus and privacy 
calculus to verify privacy policies formally. Their framework has a 
type system to capture privacy related notations and a language for 
expressing the privacy policies. Grace et al. [17] proposed a model 
of user-centric privacy with a labeled transition system, which 
compares the cloud service privacy policies with the users' privacy 
preferences. However, while they provide customizable privacy 
preferences, they do not consider environmental variables in their 
model. Although this group specifies the privacy utilizing a formal 
semantic and considers the temporal modalities, the action based 
modeling of the system is not scalable [5]. 

The scalability issue in action based systems were addressed by 
Aucher et al. [5] that proposed to specify the privacy policies over 
the knowledge that the information sharing action exposes to the 
recipients of the information. In this model, privacy policy is speci- 
fied as allowed and prohibited knowledge rather than actions, and 
different actions can result in different knowledge exchange. They 
used dynamic epistemic deontic logic (DEDL) as the foundation of 
their language. The authors define information sharing conditions 
as permitted or forbidden knowledge and the proposed language 
does not support temporal modalities. Also, Pardo et al. [36], pre- 
sented a formal language for privacy policy, using epistemic logic 
for social network models. However, their formal privacy policy did 
not contain time features; later, [24, 37] extended [36] to include 
time characteristics to the privacy language by adding time interval 
and LTL which led to the creation of timed privacy framework for 
social media. Both frameworks used a social network model and 
privacy policies as properties for model checking [7] verification. 
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while a verity of implementation based on the theory of lim- 
itation, control and RACL continues to grow, another group of 
studies focused on the implementation of CI theory of privacy. 
Barth et al. [8] have utilized first-order logic and LTL to model 
the transfer of knowledge between agents during the information 
sharing activities that are governed by Nissenbaum’s concept of 
norms. In this context, a positive norm is defined as a permission 
that allows information sharing activity and a negative norm pre- 
vents the information sharing activity. Further, implementation of 
CI was extended by DeYoung et al . [14] to include the notion of 
purpose and self-reference based on their Privacy Least Fixed Point 
(LFP) framework. The proposed framework resulted in the broader 
formalization of HIPAA and GLBA privacy laws. 

The above approaches assume that the privacy policies will be 
created in a manner that are consistent with one another. How- 
ever, privacy is dynamic in nature and as relationships and user’s 
requirements changes it is required for privacy policies to change. 
These changes can result in privacy policy conflicts. Therefore, 
Breaux et al. [10] proposed Eddy that utilized CI. The goal of their 
research was to find privacy conflicts in multi-stakeholder privacy 
policies. In order to achieve that goal, natural language policies 
are translated to Description Logic (DL)[6] so it can be used in the 
formal reasoning process to investigate whether the policies are 
consistent. Eddy and many other frameworks that are based on 
CI theory are designed and develop based on the organizational 
privacy requirements which are not compatible with individual 
users privacy requirements. 

For that reason, this paper defines and formalizes a user-centric 
privacy model utilizing CI theory. The next section describes the 
details on the methodology of our framework. 


3 A FORMAL MODEL FOR USER-CENTRIC 
PRIVACY MANAGEMENT 


This research extends the concept of contextual integrity [8] to 
provide mathematical models and algorithms that enables the cre- 
ations and management of privacy norms for individual users. The 
extension includes the augmentation of environmental variables, 
ie. time. date, etc. as part of the privacy norms, while introducing 
an abstraction and a partial relation over information attributes. 

The proposed framework is based on two sets of formal models: 
1- User’s Information Sharing Model (UISM) that represents the in- 
formation sharing activities in real-time, and 2- Privacy-Preserving 
Model (PPM) that formally specifies the user’s privacy require- 
ments. Finally, the privacy verification is performed by mapping 
each action in UISM to its corresponding action in the PPM. In 
the case of not being able to map an action a privacy violation is 
detected and reported to user to get confirmation. The rest of this 
section explains the above concepts in details. 


3.1 User Information Sharing Model (UISM) 


UISM is designed based on the formal definition of entities that 
construct Information Communication machanism based on agent. 
This is done to model user's information sharing behavior with the 
recipients, which are defined as agents [5, 8]. Hence, P is defined as 
a set of agents that are the recipient of the information sent from 
the user. For example, Alice and Bob are agents that the user shares 


information with them. In addition, T is a set of attributes that 
defines the information shared with p € P such as “home address" 
or "credit card number". 

From the above definitions, a knowledge state x is defined as 
a set of tuples of the form (p, {f1,...,t,}), which describes the 
attributes t; € T that is shared with an agent p. For example 
(Alice, (home address, credit card number]) means that Alice knows 
about the *home address" and "credit card number". As a result, if 
agents have no knowledge about the user then x can be an empty 
set. Therefore, the absence of tuples for p indicates that the agent p 
possesses no information about the user, i.e., the elements (p, 0) € x. 
Thus, « can be defined as follows where P is a set of agents and 
P(T) is the power set of attributes, 


kx € QU (P x (PLT) \ 0)) 


For brevity we use f to represent an element of P (T), i.e., {t1,..., tk}. 

In the proposed framework the user can perform two commands 
to share or stop sharing information with an agent. Each share, 
sh, or stops sharing, st command results in a communication ac- 
tion which we define as a triple (a, p. f ), where a € (sh, st}. For 
example, when user intend to share his/her home address with 
Alice, the following communication action has to be performed: 
(sh, Alice, (home address}). Thus, all possible communication ac- 
tions can be defined as 


Act = {sh, st} x P x (P(T) \ 0) 


Based on the entities defined so far, the user’s behavior model 
could be defined by a transition system where each state represents 
the information shared with the agents. Further, each transition is 
triggered by the communication action performed by the user. 


DEFINITION 1. (The User Information Sharing Model (UISM) 
Let UISMM = (K, Act, >, ko) be a 4-tuple transition system where: 


e K isa finite set of knowledge states x. 

e Ko € K is the initial state «o = 0 (no initial disclosures). 

e Act is a set of communication actions. 

e —^ C Kx Act X K is a transition relation, transform the system 
state with actions (a, p.t ) as follows: 


(shp T) |, ; " 
- k ——> K', where k«' 2 kU ((p.t)), 


-Kk GIRO, x’, where x’ 2 xV((p.t')| tnt +0}. 

It is important to note that the proposed model differentiates 
between the sequentially/simultaneously sharing of tı and tz with 
p. The sequential sharing results in x1 = {(p, {t1 }), (p. {t2})} while 
the simultaneous sharing results in k2 = {(p, (ti, t2})}. In ke if 
the action (sh, p, {t1, t2}) occurs (p, {t1, t2}) is added to the new 
knowledge set. Thus a state contains all the three tuples x3 = 
{(p, {t1}), (p. {t2}), p. {t1, t2})}. On the other hand, the performance 
of the stop command (st, p, t2) on xs will result in deletion of all 
the information attribute that contained tz from x’ = {(p, {t1})}. 
For the sequential information sharing model, we consider a sce- 
nario where user first shares his “GPS" information with Alice, 
second shares his “home address" with her, and third shares his 
billing information which is a combination of {home address, credit 
card number} with Alice. If the commutation action of stop sharing 
“home address" with Alice occurs then all the tuples that contain 
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“home address" like billing information will be removed from the 
state. 


3.2 Privacy-Preserving Model (PPM) 


The Privacy-Preserving Model is designed to manage and gov- 
ern user’s information sharing activities at run-time. Therefore, 
based on the proposed UISM in the previous section, PPM model 
is required to govern the transitions between knowledge states 
according to the norms that the user specifies. 

Since in a user-centric approach is inefficient to define a separate 
privacy norm for each p (role) and r (attribute type), the proposed 
model abstracts these two elements. This abstraction allows to have 
the same information disclosure norms with a set of agents or dis- 
close a collection of attributes in a similar manner. For example, 
the user could share her current location with all transportation 
applications, or the user could share her credit and debit cards’ 
numbers with her close family members. The following section 
describes the structure of the abstractions. 


3.21 Abstractions and Conditions. Let T be a set of attribute types 
and let AT be a partial map AT : P(T) 5 T. That is, AT maps t 
to an attribute type t € T. We can impose a partial order < on 7 
based on the subset relation between AT's domain elements f. We 
say that r4 < r» if there are exist tı and f such that AT(f4) = r1, 
AT(t2) = v» and fy C f». 

Figure 1a, and 1b demonstrate an example of hierarchy structure 
and some attributes and attribute types in that structure. The dashed 
line represents the mapping between an attribute and its type and 
the solid lines depict the order relation between the attribute and 
types. 

Similar to [8] that defines the concept of role abstraction, we 
define a set of agent roles R that can be assigned to an agent p. 
An agent can be assigned to multiple roles and roles are partially 
ordered based on their implication relation of their semantics. 

In this paper, the partial order < on R is predefined as an input to 
the model, such that the role, p1, “close friend" implies the role, p2, 
"friend", i.e., p? < p1. The order between roles implies the amount 
of relative privacy restriction of them where p» < pı means that 
p2 is more restrictive compared to p1. 

In this approach each agent must be associated with at least one 
role. Thus, we define the agent role as a function AR that maps an 
agent to a nonempty set of roles: AR : P  f?(R) \ 0. When role 
p is assigned to an agent p, then the systems adds additional roles 
that related to p through <. In other words, the set of roles for p 
should be closed under <. For example, if the agent p is assigned 
the role “close friend" p1, then the system adds "friend" role p» to 
p as well, resulting in AR(p) = (pi. pz]. 

For brevity to show the roles and information attributes that have 
a common child but are not in a partial relation with each other we 
use the < child > notation as follow: 

(1) p1 < p > p2 2 3p € P : py € AR(p) ^ p2 € AR(p) ^ py&pa ^ 

pipi p _ d 
(2 à «t» n -2dHt € P(T): AT(t) < t1 AAT(t) X T2 ^tyz T2 ^ 
Ty] 
Using these abstractions the user can define access permissions 
A as a subset of R x T such that if an element (p, 7) € A then all 
agents with role p are allowed to access attributes with type r. 
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The above abstractions of roles and information attributes provide a 
better flexibility in defining privacy norms. However, this definition 
is not complete yet, as it does not take into the consideration the 
environmental conditions where the information is disclosed to the 
recipients and has no sensitivity over the patterns and sequence of 
the information disclosure. Imagine, user is interested in restricting 
access of agents in p role to its attribute type 7 to a particular time 
interval during a work day. Moreover, the user might allow only 
up to two (p, T) accesses per such interval. 

In order to overcome this limitation, our formalism introduces the 
logic for environmental conditions y and temporal conditions ¢ to 
the definition of the privacy norm. In this model, environmental 
conditions are represented a set of variables V, where each v € V 
describes the state of an environment such as system's time, day and 
other attributes. Then, V is partitioned into subsets V; by variables’ 
type like integers, boolean, reals and so on. It is assumed that each 
type has a set of predicates Pred; and set of syntax rules to construct 
such predicates from the variables and non-logical symbols, e.g., 
constants. Then an environmental condition ( V) is expressed as a 
propositional logic over those predicates and variables, i.e., v € Vj, 
predi € Pred; as follows: 


yum |y ay |Y Vy | predi, YV; € V 


While Pred; could be produced by an arbitrary complex yet 
decidable theory for the data type such as Presburger arithmetic for 
integers, we argue that less complex theories could be adequate[3]. 
For example, for integer environmental variables V; and boolean Vg 
environmental variables the following grammar could be sufficient 
to express basic and easily comprehensible predicates predi: 


pred :==v <n|v<n|v==n,vE€Vr,nezZ 
predg := v | true | false, v € Vg 


The next entity that is defined as part of the privacy norm is the 
temporal condition g. In order to keep the conditions flexible and 
generic, we utilize temporal logic expressions to describe tempo- 
ral features of the privacy requirements. While Linear Temporal 
Logic (LTL) is very popular in expressing broad range of liveness 
conditions, they are difficult to read and understand. Utilizing LTL 
requires a strong mathematical background, and is cumbersome for 
an average system modeler to implement. Further, for the purpose 
of defining temporal conditions in privacy norm a simplified gram- 
mar will suffice, i.e define the precedence of two communication 
actions or a constant occurrence a communication actions can be 
sufficiently defined by the concatenation and Kleen star operations 
over A (the alphabet): 


9,6 :- (o,7)| e: | o, (p, 7) e A 


The 6 notation is used to represent a set of 9, in which each q for 
a given role p, can be expressed as a regular expression that allows 
sharing attributes of type r» after the sharing of attributes of type 
T» as follows: 


e= JA ((p,r) AŽ (p, t2))" AT 
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Figure 1: (a) An example of the partial order of the attributes and attribute types where the top layer show the attribute types 
and the bottom layer show the information themselves. (b) t; =GPS information, tz = home address, and t3 = credit card 
number. The middle layer represents the information that are used together for example the credit card number and the 
home address go together for billing information that is a considered as financial type. 


Here Ay = A \ {(p, 71), (p, 72)) In addition, the repetition of an 
event up to a constant k times could be expressed with the following 
formula, where the power operator describes the number of times 
a regular expression should be repeated. 


9 = A; (lp, 1) - Ay) 
where Az = A \ {(p, 7). 


Now that we have defined each elements in the privacy norm, the 
next section describes the formal specification of the privacy norm 
and techniques to ensure the consistency of the privacy require- 
ments. 


3.22 Norms and their Consistency. In this research, norms are the 
formal definition of user's privacy requirements that are used to 
govern user's information sharing behavior. In order to minimize 
the risk of unwanted information sharing, we assume that if an 
action is not explicitly defined as part of the user's privacy policies 
then it is forbidden. Therefore, the only type of norms that the user 
defines are positive norms, i.e., allowed norms. 

In this context norm is formulated as a relation between access 
permission, environmental, and temporal conditions. Hence, norm 
is represented as a tuple ((p, T), V, 9. ), where (p, T) € A and y € Y, 
9 € ®. The first element of the tuple represents the privacy policy, 
while the second and the third elements of the tuple describe the 


conditions under which the transfer of information should occur. 


The set of such is referred to as a set of norms N. 

The set N has the uniqueness property, that is, only one tuple with 
the given (p, T) values is allowed in the set. However, the uniqueness 
property is not sufficient to ensure the consistency of the privacy 
norms due to the partial relations that exist among the roles and 
attribute types. Thus, in order to utilize for privacy management 
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and detection of information disclosure, a consistency check is 
required. The Table 1 demonstrates a detailed explanation with 
examples of the different possible cases of role and attributes types 
that two norms can have during consistency checking. The row 
headers show the roles and the column headers show the attribute 
types. The cells in gray are the example of their above conditions. 


DEFINITION 2. (Consistent Norms ) Two normsn; = ((p1, 71). Y1, 
and na = ((p2, T2), V2, 92) are consistent when one of the four consis- 
tency conditions holds: 


C1. $p € P : pı € AR(p) ^ p2 € AR(p), that is, the norms defined 
for the roles with no common agents. (Table 1 row G) 


C2. ft € P(T) : AT(t) < r1 ^ AT(E) < m, that is, norms are de- 
fined for attribute types with no common information attribute.(Table 
1 column 5) 


Before defining the last two conditions of consistency, we pro- 
pose some limitations over the access permission and sequencing 
conditions of the privacy norms. Since both of these elements are 
defined for a specific roles and attribute type parameters, the first 
restriction is defined over the roles so that the same role should 
be used in the access permission and the sequencing condition 
of a norm. In the absence of this restriction, it is possible to cre- 
ate two norms that have a consistent sequencing condition but 
inconsistent access permission or vice versa. In addition, this re- 
striction enforces a constant role across the regular expression of 
the sequencing condition that reduces the regular expression's com- 
plexity by eliminating the need for a homomorphic function over 
the roles. The second restriction is defined over the attribute types, 
Vr eg xt 0 < i,j <n (An attribute type and its children 
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Table 1: The possible consistency cases based on the roles and information attribute types relations and the constrains 
over the conditions that result in consistency. The notations Fr=Friends, BFr=Best Friends, CoWr-Co-Workers, Fml=Family, 
Loc-Location, Fin=Finance, Hlth-Health, and Bank=Banking information 


1 


2 


3 


4 


5 


L(s2) € £(s) 


L(s1) € L(s2) 


L(s1) = L£(s2) 


T < T2 T) <T T = 72 tT <e> Tj < none > T2 
Loc < Fin Loc < Fin Loc = Loc Fin < Loc > HLth Loc < none > Bank 
c2 € c C2 = c 2 — 0 C2, —— Ch 

sd ria L£(s1) = £2) L61) € £2) Lisi) € £2) £i) € Le) DM 

B | Fr « BFr Share Loc with Fr when c1 an | Share Fin with Fr when c1 and | Share Loc with Fr when c1 and | Share Fin with Fr and Health | Since Loc and Bank are in- 
s1, share Fin with BFr when c2 | s1, share Loc with BFr when c2 | s1, sare Loc with Bfr when c2 | with BFr (or vice versa) which | comparable then those norms 
and s2. Fin should be guarded | and s2. Fin should be guarded | and s2. Loc should be guarded | can share Loc. Loc should be | should always be consistent. 
the same or better, c; => c», | the same or better, c => cj, | at least the same way, c; €» | guarded at least the same way 
(sg) € L(s1). BFr can have | L(s1) € L(sz). BFr can have | c2, L(s1) = L(sz). BFr can | c €» cz, £(s1) = (sz). BFr 
less restrictive access, cp ==> | less restrictive access cp => | have less restrictive conditions, | can have less restrictive condi- 
c1, £(81) € £(s2) c1, L(s1) € L(s2) €? = c1, £(s1) € £(s2) tion, c => c, L(s1) € £(s2) 

C E ey c9 c? — 06 Fal c) € 0 P 

EN L£(s2) € £(s1) Lisi) € £(s2) n Lln) = £2) "A 

D | Fr=Fr Share Loc with Fr when c1 and | Share Fin with Fr when c1 and | There should be only one rule | Share Fin with Fr when c1 and | Since Loc and Bank are in- 
s1, share Fin with Fr when c1 | s1, share Loc with Frien when | for the same role and attribute | s1, share Health with Fr when | comparable then those norms 
and s2. Fin should be guarded | c2and s1. Fin should be guarded | type - the uniqueness property | c2 and s2, which can share the | should always be consistent. 
the same or better way c} => | the same or better way, c. => same attribute Loc. Loc should 
c2, L(s2) € L(s1). Fr should | c1, L(si) € L(s2). Fr should be guarded at least the same 
have at least the same access, | have at least the same access way c € cs, £(si) = L(s2). 
c1 € co, L(s1) = £(s2). cy € co, L(s1) = L(s2) Fr should have the same access 

c1 € c2, £(s1) = £52) 
ES xp py ca = 02 c2 — 0 c2 e c c2 ec True 


L(s1) = £(s2) 


F | Fr Anna CoWr 


Share Loc with Fr when c1 and 
s1, share Fin with CoWr when 
c2 and s2, which have Anna as 
a common agent. Fin should be 
guarded the same or better way 
cy = > cg, L(s2) € L(s1). Fr 
and CoWrk should have at least 
the same access to Loc c1 © c2, 
L£(s2) = £(s1), since they share 
an agent. 


Share Fin with Fr when c1 and 
s1, share Loc with CoWrk when 
c2 and s2, which have Anna a 
common agent. Fin should be 
guarded better than Loc cg => 
cy, L(si) €  £(s2). Fr and 
CoWrk should have at least the 
same access to Loc cz © ci, 
L(s1) = L(sz), since they share 
an agent. 


Share Loc with Fr when cl 
and s1, share Loc with CoWrk, 
when c1 and s2, which have 
Anna as a common agent. Loc 
should be guarded the same 
way c € c, L(s1) = L(s2). 
Fr and Cowrk should have the 
least the same access to Loc, 
€ © c5, L(s1) = L(s2), since 
they share an agent. 


Share Fin with Fr when cl and 
sl, share Health with CoWrk 
when c2 and s2, which have 
Anna as a common agent. Loc 
should be guarded at least the 
same way c © cs, £(s1) = 
£(s2). Fr and CoWrk should 
have the same access to Loc 
cı € c2, L(s1) = £(s2), since 
they share an agent. 


Since Loc and Bank are in- 
comparable then those norms 
should always be consistent. 


G | p1 € none > pz 


True 


True 


True 


True 


True 


H | Fr, none, Fml 


Since Fr and Fml are incompa- 
rable then those norms should 
always be consistent. 


Since Fr and Fml are incompa- 
rable then those norms should 
always be consistent. 


Since Fr and Fml are incompa- 
rable then those norms should 
always be consistent. 


Since Fr and Fml are incompa- 
rable then those norms should 
always be consistent. 


Since Fr and Fml are incompa- 
rable then those norms should 
always be consistent. 


are not allowed to exist in the same regular expression). This re- 
striction ensures that all the communication actions are inspected 
not only for the super-type r, that is explicitly inferred from the 
communication action, but also for all the children of t that will 
be implicitly revealed by that communication action. Without this 
restriction, it is possible to create a regular expression that allows 
for sharing an attribute type and its children consecutively while 
it is not taking into the account that the children are shared more 
than once. 

Further, the comparisons ofthe access permission component of the 
norms are conducted based on the partial relations that exists over 
the roles and attribute types. In addition, the comparison between 
the environmental conditions is implemented based on the Boolean 
algebra. To examine the sequencing conditions for consistency, we 
need to compare the regular expressions. the comparison of two 
regular expressions is not possible if they do not share the same al- 
phabet. Therefore, we need to introduce a mechanism that projects 
the language of one regular expression to the other one and brings 
the regular expressions to a common alphabet. 


DEFINITION 3. ( Projection of the Language ) Let ~ and 92 
have the following symbols to be tracked: 


91-— {(p, 71), (p. 72), t (p, Tk)} 


92 = ((o' T1), (9^. T2), . .. (0^ Tn)} 
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We define Q1 = L (P1)p as the projection of Q1 on pz where Ly 
receives a regular expression and maps it to another one. To achieve 
a similar language to compare (1, 92 we traverse over the attribute 
types. For each attribute type, we check for its children or another 
attribute type that has a common child in the other regular expression 
and add the children or the common child to a set in a map. After 
traversing over all the attribute types in both q1, q» to substitute the 
uncommon parts, we generate all the possible substitution for attribute 
type vj exist in the map. The substitution for vj for reaching a common 
language is a disjunctive regular expression. The disjunctive regular 
expression is generated as follows. Let sub be a set of all ti children and 
common children that have been found in the other regular expression. 
We define sub = f? (sub) XV 0. For each s € sub we generate all the 
permutations of elements of' s and add them to the regular expression 
with disjunction operator. For example, sub = (14, Tp} then sub = 
{{ta}, {ta}, {ta, Tp}} and the result of the regular expression that 
is used for substitution is tq|tp|tatp|tp ra. After reaching the same 
alphabet, the consistency of the regular expressions can be decides 
based on the norms’ access permission. 


C3. pı < pa and either 4 < T2 or T2 < rj then jı => oa 
£4(91)o, € L (G2)¢,, that is, n is for a specialized role p of pı and 
its attribute type T2 encompasses Tı or vise verse then environmental 
condition of yo should be the same or less restrictive than of V4 and 
its regular expression ~2 should describe the same or less restricted 
projected language than of ~1.(Table 1 row A,C and columns 1,2,3) 
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C4. pi < p > pz orri < t > v2 then yi © o ^ L (Yim = 
L (92) o,- If there is at least one agent that can be assigned to both 
unrelated roles or an information attribute that share a common child 
then the environmental conditions and the projected language of the 
regular expressions must be equivalent.(Table 1 row E and columns 4) 


3.23 Policy Compliance Verification. The set of norm N defines 
a Privacy-Preserving Model, (PBM) which describes compliant in- 
formation communication actions at the level of attribute type and 
agent role abstraction levels. The knowledge states of PBM are 
consists of tuples (p, T), which indicate that at least one agent with 
p role know about attribute represented by v. The transitions repre- 
sent the abstracted communication actions Act from {sh, st}xRxXT 
guarded by conditions ® and ¥ defined in N. 


DEFINITION 4. ( Privacy-Preserving Model) is a set of observers 
over norms N where each observer is a tuple of (K, Act, c, m) rep- 
resenting nj = ((p,t),W.9) € N where K = (p,T) c = y is the 
pre-condition and m is a monitor representing q regular expression. 
The transition Act is given to Monitor m to update the state of the 
monitor. 


3.24 Verification. To ensure that the user's behavior is compliant 
with the privacy policy, we need to map the current state and 
the next state of user's behavior model to the privacy preserving 
behavior model. 


DEFINITION 5. (Mapping from user behavior to privacy pre- 
serving domain) Let MS : K  K be a surjective function, where 
MS(p,t) = ((p,r)]p = AR(p),t = AT(t)) and MT : Act > Act 
where: 


MT(a, p,t) = {(p, T)|p € AR(p) At € AT(t)) if a = sh 


In the case that there is no mapping for the next state in the 
PPM, the communication action that triggered that transition will 
be reported to the user as disclosing. 


DEFINITION 6. ( Valid user behavior) Let user behavior system 
be at state k that maps to k in the privacy preserving behavior model 
and the action (sh, p, t) happens. If MP(p, t) exists, and the environ- 
mental variables satisfy y and m(MS(a, p, t)) is in the final state then 
the communication action Act is valid. 


The goal of privacy rules is to prevent the user from entering 
into a privacy violating states. 

After reporting a privacy-violating action the user can ignore 
it and the framework allow the information sharing to happen. 
All this communication happens through the user interface of the 
framework. The next section provides implementation details of 
the framework's components. 


4 IMPLEMENTATION 


As a proof of the concept, we prototyped the proposed framework 
in the Java programming language !. Figure 2 depicts a diagram 
of the implementation's architecture. The blue components show 
the libraries and technologies used in the proposed framework. 
The proposed framework is modularized into three layers:(1) User 
interface layer which takes the user's intentions in a structured 
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Figure 2: The architecture of user-centric privacy frame- 
work. 


format,(2) Translation layer which translates the frameworks from 
UI to privacy norms and formal notation, (3) Verification layer that 
evaluates norms consistency and compliance of the information 
sharing action with privacy norms.The following sections describe 
the implementation details of each of the components in each layer. 


4.1 User Interface Layer 


The user interface (UI) layer facilitates interactions between the 
user and the proposed framework. Through the UI the user can 
add and view the existing privacy norms and get privacy violation 
reports. The UI is designed to conceal the complexity of the un- 
derlying formalism and verification from the user. The UI hides 
the complexities by allowing the users to express their privacy 
intentions as a structured input. Using the UI the user can select 
the role and attribute type from a drop-down list. To create the 
environmental conditions, the user can provide arbitrary inputs for 
environmental variables or choose between predefined conditions 
e.g., daytime, nighttime, weekends. Also, the user can specify the 
desired information sequence in the form of precedence or repe- 
tition templates like “X happens after Y" or “X happens k times”. 
These templates will be translated to sequencing conditions. 


4.2 Translation Layer 


The translation layer receives the structured input from the UI 
and translates it into formal notation. The formal notations and 
maps described in the methodology section can be implemented as 
tables in a database. The norm are stored in the norms table where 
the table attributes are the role, attribute type, the environmental 
conditions, and the DFA state of the sequencing conditions. The 
primary key of the norms table is the pair of (p, t). The system 
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queries the database to retrieve the norms in order to either verify 
an action or check the consistency of a new norm. To evaluate each 
action with the attribute t and the agent p, norms that have roles 
where p = AR(p) and attribute type t = AT(t) will be retrieved 
from the norm table and sent to the verification layer. 


4.3 Verification Layer 


This layer verifies the information sharing actions compliance with 
the privacy norms and the consistency of a new norm with existing 
norms. If an information sharing action violates the privacy norms 
or anew norm causes inconsistency, then this layer sends a violation 
report to the UI to inform the user. The user can ignore the violation 
caused by the information sharing action and allow the information 
to be shared. With an inconsistent norm, the user has to change 
the new norm so that it will be consistent with other norms. The 
rest of this section describes the verification method of information 
sharing actions and privacy norms in more detail. 


4.3.1 Verification of norms for Inconsistency. When a new norm 
is created, the framework checks the consistency of the new norm 
with the existing norms. Based on the consistency constraints in 
section 3.2.2 the framework first ensures that the new norm access 
permission does not exist in the database. Then the new norm’s 
environmental conditions are checked for consistency. The frame- 
work parses the string of the environmental conditions and changes 
them to SMT solver formulas. Then the SMT solver needs to prove 
that the implication or equivalency relation holds and it is always 
valid. Validation assessment of formula f by SMT solvers is done 
by proving that — f is unsatisfiable, hence f always evaluates to 
true. By proving that there is no combination of variables that sat- 
isfy ^f it can be concluded that f is a tautology. In a case that 
the solver finds a solution to — f, the user is asked to change the 
inconsistent new norms. Further, since efficiency is important in 
real-time systems, we need to assign a time limit for the solver. If 
the solver times out or returns UNKNOWN the user will be notified. 
Finally, if the norm was consistent it will be added to the database. 
The implementation of the proposed framework utilizes JavaSMT 
[23] with the Z3 solver version 4.3.2 [13] for consistency checking 
over the environmental variables and "brics" library version 1.12-1 
[31] for sequencing conditions. 


4..2 Verification of Actions for violation. For each action (sh, p, t) 
, the framework finds the attribute type of t and the role of p. 
Then the privacy norms tables are queried to find the norms with 
the access permission (AR(p), AT(t)) as their primary key. If the 
query returns no results, it means that no norm allows sharing 
information t with agent p. However, If the query returns results, 
it indicates that there exists a mapping from a state in UBM to a 
state in the PPM. Then the framework checks for the satisfaction 
of the environmental conditions and sequencing conditions before 
taking the transition to the mapped state. 

Since the norm conditions are dynamic, they cannot be hard- 
coded in the verification engine. Therefore to check the environ- 
mental variables a mechanism is needed to enable the verification 
engine to handle change in the conditions. Therefore, the condi- 
tions are formed and evaluated at run-time based on the stored 
environmental constraints in the database. For the implementation 
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of such a mechanism that allows for dynamic manipulation and 
evaluation of conditions, the Expression Languages (EL) can be 
used. EL receives an object and a logical expression as a string and 
evaluates whether the object properties satisfy the expression or 
not. In our implementation,the current snapshot of the environment 
is given to the EL as the input object that has the environmental val- 
ues and the EL expression string is the environmental constraints 
of the retrieved privacy norms. This framework employs Spring 
Expression Language (SpEL) [1] as the EL library. EL only checks 
for the satisfaction of the environmental conditions and if they are 
not satisfied then the transition guard is not satisfied. Therefore, 
the action violates the privacy model. However, if the environmen- 
tal conditions are satisfied then we check for the satisfaction of 
sequencing conditions. 

Sequencing conditions implemented as run-time monitors from 
the regular expressions stored in the database. A run-time monitor 
is a deterministic finite automaton (DFA) that is created based on a 
regular expression. The DFA representing the sequencing condition 
has a pointer to its current state and changes its state with the 
occurrence of information sharing actions. If the new state in the 
DFA monitor is not a final state, then the action is not valid, and 
the system reports the violation to the user. Different libraries exist 
for creating run-time monitors such as Aspect], but the monitors 
created by them are static. Therefore, a change in one of the regular 
expressions demands a reset in all the monitors. In the proposed 
framework the regular expressions are dynamic, and changing a 
regular expression only causes a reset in the corresponding DFA. 
Another method for implementing the sequencing conditions is to 
store a history of information sharing actions; however, with each 
information sharing action, the history will grow, and to potentially 
infinite size. With the run-time monitors, the number of the DFAs 
are constant and equal to the number of the norms with sequencing 
conditions. Algorithm 1 shows the general steps taken to implement 
the information sharing action verification process. 


Algorithm 1: Action verification algorithm. 


1 Input: CA (Communication action) 


N 


Output:Boolean value indicating the verification result. 
3 norms=[] 
4 roles-CA.recipient.getRoles() 


wa 


types-CA.informationAttribute.getType() 


a 


for r in roles and t in types do 


7 norms.append(getnorms(r,t)) 

8 if (norms.size> 0) then 

9 for j in norms do 

10 if !(j.evalEnvironmentalCondition (CA.environment)) 
then 

1 | return false 

12 else 

13 if !(j.evalSequencingCondition(CA)) then 

14 | return false 

15 return true 


16 return false 
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Considering the above implementation, in the next section we 
discuss the performance evaluation of the proposed framework. 


5 PERFORMANCE EVALUATION 


The proposed framework is designed for user-centric applications; 
therefore, it should have acceptable performance on smart devices 
such as smart-phones, internet of things devices and etc. The main 
challenge in this area is that usually, these devices have low mem- 
ory and computational power. Since detection of privacy violations 
in such applications supposed to be real-time, a framework with a 
substantial performance overhead cannot deliver the desired results. 
Therefore, our implementation was tested for performance evalua- 
tion on a Raspberry Pi model B with 700 MHz CPU, 512 MB RAM 
and running Raspbian 4.9 operating system. As well as a PC with 
3.0 GHz AMD Phenom II X4 945 processor, with 8 GB of memory 
and Windows 7 operating system. The privacy policy created for 
this test contained 81 privacy norms over 12 attribute types and 16 
roles which 8 of them have nonempty intersections with another 
groups. 

Table 2 shows the results of the information sharing action veri- 
fication performance evaluation. The number in each column in- 
dicates the average verification response time for each part of the 
a privacy norm. The average was computed for 20 information 
sharing actions which half were privacy violating actions and the 
other half were non-violating actions. Also, notice that the perfor- 
mance of the action verification depends on the performance of 
underlying database software and expression language library. In 
the implementation of our framework, we used MariaDB version 
10.2 database and SpEL 3.1.0 as the EL library. 


Table 2: Action verification performance evaluation results. 
The columns show the response time for Access Permis- 
sion (AP), Environmental Conditions (EC), Sequencing Con- 
ditions (SC). 


Action Verification 
AP EC SC 
PG 15ms | 0.5ms | 3.5 ms 
PiB 39 ms 6ms | 540 ms 


Machine 


The average time for the consistency check performance evalua- 
tion on the PC was 39 ms and for Raspberry Pi model B was 849 ms. 
Also, notice that the performance of this consistency checking de- 
pends on the performance of the underlying solver and the domain 
size of the environmental variables (since the solvers are faster when 
the search domain is smaller). For example, in our implementation, 
the norms time conditions were specified as (hoursx100+minutes) 
and time intervals could be subsets of each other. Table 3 shows the 
SMT-solver performance for constraints with 5,10,20,50,100, and 
500 environmental variables. The over-head of bric library for lan- 
guage sunset and equivalency is around 7ms on average. However, 
the projection algorithm is the bottleneck since it computes the 
permutation of the information types that are needed for substitu- 
tion in the regular expression. Due to this drawback the framework 
limits the number of children for each attribute to 5 children. 


Table 3: Performance of consistency checking for Environ- 
mental Variables 


Number of Variables | 5 | 10 | 20 | 50 | 100 | 500 
Implication time (ms) | 26 | 28 | 30 | 40 | 35 66 
Equivalency time (ms) | 32 | 34 | 35 | 46 | 41 | 67 


6 CONCLUSION AND FUTURE WORKS 


Administrating and managing users’ privacy is a major challenge in 
the digital age. Privacy has a different meaning to different users de- 
pending on their personality, age, social status, cultural background, 
and many other factors. However, current privacy management sys- 
tems cannot address these privacy needs adequately since they are 
not designed based on the users’ privacy perspectives. Therefore, 
the lack of user-centric privacy management tools and algorithms 
limits users’ ability to have control over their data sharing activities 
and puts unaware users at risk of information disclosure. In order 
to overcome these limitations, the proposed framework provides 
a privacy formalism and verification engine to specify and model 
privacy from the user’s perspective. Moreover, as a proof of concept, 
a framework was implemented and tested based on the described 
formalism. In the proposed model, the contextual integrity theory 
has been customized to address the privacy needs of individual 
users. Further, the user-centric privacy framework is meant to be 
utilized in the new generation of smart devices and IoT, which 
compared to servers and general purpose computers, have lower 
memory and computational power. These limitations justify the 
use of regular expressions instead of Linear Temporal Logic (LTL) 
in our paper since empirical evidence [9] shows that the evaluation 
of the regular expressions has significantly less overhead compared 
to LTL. 

The future work will eliminate the current user interface and user’s 
privacy norms will be generated automatically utilizing text anal- 
ysis, speech recognition, and AI algorithms that can infer user’s 
privacy policies based on the user’s relationships and information 
sharing behaviors. 
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ABSTRACT 


The classic case pattern mining is a fundamental subject in data 
mining and big data science. The goal of the mining is to find 
correctly from a given dataset the patterns and their respective 
intrinsic frequentness. This paper examines two important yet 
misused instruments, the pattern frequentness measure "support" 
and the full enumeration pattern generation mode, which cause 
serious Overfitting thus deviate from the mining goal. A theoretic 
combined solution for the two critical issues is then proposed. This 
solution plus the equilibrium condition introduced in this paper 
forms a set of three fundamental rationality check criteria that 
every mining approach should observe. As such, the rationality of 
the mining theory and the reliability of the mining results would 
be substantially improved from the previous work. These together 
promise a significant change towards more effective pattern mining. 


CCS CONCEPTS 


* Information systems — Data mining; Evaluation of retrieval 
results. 


KEYWORDS 


data mining, frequentness measure, overfitting, pattern frequency, 
pattern mining, probability anomaly, selective pattern generation, 
underfitting 


ACM Reference Format: 

Tongyuan Wang and Bipin C. Desai. 2019. On the Appropriate Pattern Fre- 
quentness Measure and Pattern Generation Mode - A Critical Review. In 
23rd International Database Engineering & Applications Symposium (IDEAS’19), 
June 10-12, 2019, Athens, Greece. ACM, New York, NY, USA, 15 pages. 
https://doi.org/10.1145/3331076.3331125 


1 INTRODUCTION 


Starting with item-set mining [1] [5], pattern mining has expanded 
into many current data mining research areas, for instance, medical, 
genomic and so on. Extensive research to date has been reported on 
frequent pattern mining. Most of the research focuses on algorithms 
and computation performances, including scalability and memory 
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optimization [19]. Performance is certainly important in dealing 
with large datasets, but the first important yet less studied issue is 
the theoretic establishment of the mining mechanism and reliability, 
including their definitions, measurements and test norms. Once the 
fundamental mining theory is established one could attempt the 
design, implementation and testing of the mining algorithms. This 
paper is a humble effort in this direction to clarify some basic con- 
cepts and issues embodied in the two important instruments used 
in the mining: the pattern frequentness measure and the pattern 
generation mode, such that the rationality of the mining theory 
and the reliability of the mining results can be improved. 

Frequent pattern mining, as its name implies, can be simply 
defined as a "frequentness based mining". It differs from other 
data mining techniques, such as classification or clustering that 
use other characteristics of the datasets for the mining. Because 
frequentness is the only criterion, a proper measure of it is of 
fundamental significance. A more profound issue is what pattern a 
frequentness should be assigned to. For a given dataset, normally we 
do not know, a priori, the number and for each pattern its makeup. 
Conventional mining approaches use "support" to measure pattern 
frequentness, along with the use of the full enumeration mode to 
generate patterns. This measure and the mode, however, are not 
justified under our review based on established probability and 
statistics theories. The drawbacks and consequences of the use 
of the measure and mode are detrimental for an effective pattern 
mining. Our solution is a reformulated support combined with the 
selective pattern generation mode to get around the drawbacks. 

The second section of this paper investigates and discusses the 
drawbacks of previous approaches. The third section discusses and 
proposes a solution for the shortcomings of the "support" measure. 
The fourth section reexamines the infeasibility of the conventional 
full enumeration pattern generation mode and identifies a selec- 
tive mode to replace it. The fifth section analyzes and justifies the 
effectiveness of our proposed solution. The last section gives our 
conclusion. 


2 RELATED WORK AND OPEN ISSUES 


This paper focuses on pattern mining over non-continuous data 
sources. However, the principles discussed herein could be applica- 
ble to continuous data sources as well. 


2.4 The problem and terminology 


According to the application domains, pattern mining has devel- 
oped from the early days' market-basket item-set mining to today's 
temporal pattern mining, spatial pattern mining, sequential pattern 
mining, health care data mining, genomic pattern mining, and so on. 
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Nevertheless, the fundamental problem, finding frequent patterns, 
is common in all of these tasks. As there remain unsolved important 
issues in pattern mining we start our investigation from the market 
mining problem. Historically, it is the starting mining model, and 
most subsequent mining subjects and techniques inherited the basic 
concepts and methods from the item-set mining. Meanwhile, most 
readers are familiar with the market dataset and it remains one of 
the top mined data types in recent surveys [2] [3]. However, we will 
see that what is discussed in this paper is generic and not restricted 
to the transactional (market) dataset mining. 

Table 1 (DBo) is the 
dataset for the running 


Table 1: A Database (DBo) 
example used in this pa- 


per. It can be seen as 


an abstraction of market ZID- SD 
transactions typically pre- n Vi Ve V7 
sented in published item- T2 V2, Va, V7, Vg 
set mining articles [1] [5] T3 V2, V6 
[19]. The DBo has u rows T4 V1, Vo, Vg 
and two columns. Column Ts Vi, V2, Va, V4, V7, Vg 
TID represents the key qu Vs 
attribute and VID repre- T; V} V; 
sents an application do- 
main Q of n distinct ele- Ts V5 

To | Vi, V2 


ments. Each row is a tuple, 


Tio | Vi, V2, V3, Vg 


where T; (i = 1,2, ..., u) is 


a tuple ID, and each cell of 

column VID contains a value V (or a set of values) of that domain. 
For example, in a market-basket problem, a TID could represent 
a transaction ID, and a value of VID, Vj (i = 1,2, ..., n), represents 
an "item" from the domain Q of merchandise. Particularly, a com- 
bination of k distinct Vs is termed as a pattern Z, = (ViVj...Vs) 
of length k, or k-itemset in market-basket problem [1] [5]. In con- 
ventional mining approaches, the enumeration process of such 
combinations is called pattern generation. By convention, the num- 
ber of occurrences of a pattern Z over the database is noted as Sz 
or S(Z), named as the (absolute) support of Z. In some literatures, 
such occurrence is also measured as the (absolute) frequency of Z 
and noted as F(Z). The relative support is a ratio and noted as sz 
or s(Z) [1] [19], such that: 


sz = s(Z) = count(Z)/|DBo| = F(Z)/u = S(Z)/u=Sz/u, | (2.1) 
where u = |DBo| is the total number of tuples, i.e., the cardinality 
u of DBo. For instance, from Table 1, we can get S(Vi V2) = 3 and 
s(ViV2) = 3/10 = 0.3. 

Obviously, (2.1) comes from classical frequency based probability 
concept, and sz should be taken as the first link between probability 
theory and pattern mining. Particularly, since Sz or S(Z) corre- 
sponds to F(Z) as an absolute frequency measure, sz or s(Z) should 
be taken as the frequentness measure. 

In statistics terminology, the dataset DBo is a sample of the real 
world application at hand. The cardinality u of DBo is the sample 
size; and a record (tuple) is a realized event of the sampling, hence 
a subset of Q [16]. In data mining language, we call each original 
tuple (event) an original pattern, or an original observation. A TID 
can be taken as a sample label or trial ID, and the column VID refers 
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to the set of events [6]. Based on these notations, the fundamental 
pattern mining problem can be stated as follows: 

Problem 2.1: Given a dataset DBo as shown in Table 1 involving 
the universe Q of n distinct elements of domain VID, output all 
patterns of the elements in any length, such that sz of a pattern 
Z satisfies s; > Smin, where Smin is a user predefined minimum 
support; such satisfactory patterns are qualified patterns. 

The above is a general introduction to the basic concepts of 
item-set mining, but at this point we'd notice importantly that, 
although the dataset of Table 1 was originally initiated to solve the 
market mining problem [1] [5]. However, it can only be seen as 
an abstract dataset and could not fully reflect thus fully solve the 
market problem. While the basic reasons are given below, we will 
further clarify this issue in next sections. 

1). From the literatures as summarized in the next subsection, the 
“item" presented in the datasets such as Table 1 assumes the same 
concept of the "element" in set theory. That is, an item or element 
is unique and atomic (indivisible) presented in each data tuple. We 
call the uniqueness and indivisibility properties together as the 
“classic data nature". This nature is reflected in the calculation of a 
single Sz shown above. Hereafter, if there is no ambiguity, we use 
the words "item" and *element" interchangeably. 

2). Although each V; in Table 1 is assumed to represent an item 
(product), the table is abstract. Particularly, in the very early itemset 
mining article [5], the mining problem is specified to deal with *a set 
of literals". Each element (item) is called a "literal", and coded with 
an ID as can be inferred from that article. This coding convention 
is adopted in later research papers especially the dataset reservoir 
[47], wherein all the elements stored in every dataset are coded 
and each element is represented with a unique number. That is, the 
semantics of each element in every dataset is ignored, such that all 
elements are taken to be homogeneous with only their numerical 
IDs left to distinguish each other. 

3). The dataset is static. 

With the above characteristics, we call the dataset such as Table 
1 as the “classic dataset". 

Since the classic dataset is abstract and could not exactly reflect 
the market problem, the so called itemset mining over it may not re- 
alize its mining purpose adequately. On the other hand, this dataset 
could closely model some other mining problems. For instance, if 
each tuple of Table 1 represents a daily record of a museum visition 
and each Vj represents a visitor to that museum, then V; is naturally 
unique and atomic in a tuple. Considering these aspects together, 
we refer the classic dataset (Table 1) with Problem 2.1 together 
as the "classic case mining model" or simply the "classic mining 
problem". Accordingly, the mining issues and their solutions to be 
discussed in the rest part of this paper is generic and applicable to 
any mining problem of the similar nature of the classic mining but 
not restricted to item-set mining only. 

The classic case model is the simplest thus fundamental mining 
model. Only after the mining theory and techniques on the simplest 
model have been well established, could we properly proceed to 
more complex mining applications. This is the basic significance 
we reexamine the classic case mining. 

It is not too difficult to comprehend the classic mining problem, 
while the main issue for most data mining research is the computa- 
tion complexity due to potentially the power set (2") of possible 
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patterns over the n-element domain Q. Note that, in this paper we 
do not take the empty set (Ø) as a pattern, and let its frequency F(@) 
as undefined (this is because every tuple could produce Ø, taking 
Ø as a pattern would lead to a couple of problems in pattern gen- 
eration and frequency determinations). Hence the largest number 
of possible patterns is 2" — 1. The power set complexity demands 
not only a great amount of computation time but also large mem- 
ory space to store the candidate patterns and other information. 
Whereas the computation complexity issue has been extensively 
studied, however, the more fundamental work on the appropriate- 
ness of frequentness measure (2.1) and the pattern generation mode 
seems to be neglected. In the next subsection we briefly review 
some of the work on pattern mining. 


2.2 Related work 


In the classic case mining, earlier mining approaches mainly fo- 
cused on how to efficiently obtain all possible qualified patterns 
from a given dataset. Examples of these proposals are the Apriori 
algorithm [5] [19] and its variations and extensions, such as the 
“incremental mining" [34], the “dynamical item-set counting" [35], 
the “parallel and distributed mining" [38], the “hash-based" [39] and 
the "partitioning" [39] algorithms. The Apriori based approaches 
feature pruning infrequent item-set as early as possible to achieve 
computation efficiency by use of the downward closure property 
which states that the super patterns of less frequent patterns could 
not be frequent [5] [19]. Subsequently, an approach called frequent 
pattern growth (FP-growth), originally developed in [7], tries to 
achieve efficiency by avoiding candidate pattern generation so as 
to reduce memory space requirement and IO cost. The avoidance 
is achieved by a construction of frequent pattern tree, or FP-tree, 
which is a prefix tree and acts as a compression of the original 
database, such that a data tuple is embodied in a branch of the 
tree. Then pattern mining from the original database is converted 
to mining from the FP-tree. This approach has also been further 
developed with many extensions and variations, such as the "hyper 
structure mining" approach [21], the "bottom up and top down" 
tree building approach [41] [32], the array based data structure to 
implement the prefix tree [42], and the like. There is also a pro- 
posal which avoids multiple IOs and reduces the candidate pattern 
generations by statistical estimation based on database scanning 
[8]. Other approaches try to improve mining efficiency by differ- 
ent pattern search strategies, including “breadth first" and “depth 
first" search methods. In a breadth first search the patterns are gen- 
erated and examined from each record of (Tj: (Vj]) in horizontal 
data format [19]. In a depth first search, the original dataset DBo is 
transformed into a "vertical" dataset (V j: {T,}) before patterns and 
their frequencies are determined [9] [10]. 

The above approaches and many others not listed have a com- 
mon feature that all possible qualified patterns are produced and 
collected in the mining result set and delivered to the user. The 
result set is usually very large for a fairly large dataset, leading 
to the problem of interpreting the result. Researchers tackled this 
issue by presenting the mining result set in a reduced form. 

The "constrained" pattern mining [20] [21] [22] reduces the min- 
ing result set size by user constraints. For instance, a user may 
want to mine patterns with V4 and V2 only from Table 1. Another 
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school of reduction approaches is the "concise" (“condensed" [4], 
or "compressed" [30] [31]) representation of the patterns, where a 
small subset of the frequent patterns is used to represent the whole 
mining result set. For example, the "free sets" [24] or "generators" 
[25] are concise sets to represent the whole result set in an appli- 
cation. Similarly, other concise sets and mining approaches, e.g., 
"disjunction-free" [26], or “non-derivable" sets [23], the “succinct 
summarization" [29] and the “krimp" [33] approaches, have been 
proposed, while the "closed" [11] [13] and the "maximal" [14] [15] 
[? ] approaches have attracted more attentions. A pattern is closed 
if none of its proper super-pattern takes the same frequency [13] 
[19]. A pattern is maximal if none of its proper super-pattern is fre- 
quent against a smin [13] [19] . The “closed set" representation is a 
lossless compression of the results set, in the sense that all patterns 
and their supports can be derived from the closed set; while the 
"maximal" expression is a lossy compression [19]. Similar to a lossy 
compression, there are approximation approaches to represent the 
mining results, for instance, the “top-k frequent patterns" [37] [48], 
“top-k most frequent closed" [27] [40], and the “pattern profile" [28] 
approaches. 

Another big group of researchers propose to use the "interest- 
ingness measures" [54][55][58][57][62] or “quality measures" [63] 
called alternatively to reduce the mining result set. The interest- 
ingness measure school is complex and originated from dozens of 
reduction proposals not only for association rules but also for clas- 
sification rules and summaries mining applications [57]. Yet there 
is no formal or generally accepted definition, the interestingness is 
taken to be determined by 9 criteria, such as conciseness, coverage, 
utility, and the like. These 9 criteria can be further categorized into 
three classes: objective, subjective and semantic based [57]. Par- 
ticularly in pattern-association rules mining, there are in total of 
38 objective interestingness measures including the original “sup- 
port" and "confidence" measures, as summarized in [57] from the 
literature [56][59][60][61], let alone subjective and semantic based 
measures. It is thus not a trivial job to find a proper interestingness 
measure in an application [56]. 

Similarly, there are proposals to use weighted measures [64] [65], 
for instance, by assigning different weights to the smin for itemsets 
of different importance, to pursue a better mining. 

To date a multitude of research papers related to pattern min- 
ing have been published. However, only a small portion of these 
could be cited in this part due to space constraint. And since our 
focus is on the classic case mining, for a full understanding of the 
problem and the history, we need to look into those early proposed 
concepts and approaches, Many later developed mining approaches 
such as those on unclassical datasets thus beyond our focus, e.g., 
stream data [71] [72], uncertain data [68][69] or data with different 
weights [66] [67], etc., are not particularly mentioned hereto, but 
our discussion in this paper will also have important impact on 
these approaches. Since these approaches inherited the basic con- 
cepts and methods from the classic case mining approaches thus 
inherited their drawbacks stemming from the s(Z) and the full enu- 
meration pattern generation mode. After the theoretic foundation 
of the classic case mining has been reshaped and solidified, the 
remaining mining approaches will have to be rebuilt. 

From the above summarized literatures, we see previous re- 
searchers have noticed the problem of excessive number of result 
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patters in the classic case mining and tried many ways to remedy 
the problem. However, in our examination the problem is still not 
essentially solved, and the basic reason is that previous remedies 
are mainly on the squeeze of the size of the result pattern set by 
using different exogenous measures or constraints but not on the 
ill-definition of the measure s(Z) and the infeasibility of the full 
enumeration pattern generation mode. 

For instance, the interestingness measure or the weighted mea- 
sure yz over a pattern Z assumed by different researchers can be 
generally expressed as yz = f(sz, X), where X is a vector of the 
proposed exogenous measures, modifiers or constraints by a re- 
searcher. Whether X is properly proposed or not, since the support 
measure sz is not well defined (to be seen in the next section), yz 
could never become sound thus could not function well. 

Even though the concise or representative approaches are as- 
sumed to overcome the drawbacks of the full enumeration pattern 
generation mode so as to reduce the result pattern set size, they 
are based on that mode. For instance, for a given pattern Z, they 
produce the same sz as other approaches. Secondly, the downward 
closure property is still kept in these approaches (this property is 
only a phenomenon of the full enumeration model as to be seen 
later). Thirdly, these approaches can not claim whether only pat- 
terns within the concise result set are meaningful while others 
outside the set are not. Additionally, the "closed", “maximal" and 
the like concepts are indeed other exogenous measures as well. 
These exogenous measures are created by the researchers (miners) 
but not from the users or without a clear connotation of the mea- 
sures to the users. For instance, a user may just want the miner to 
deliver a set of reliable patterns to him but does not know what 
exactly the closed or maximal patterns mean to him. 

We do not generally object to the use of exogenous measures or 
constraints in the mining, while our point is, only after the two basic 
instruments, the frequentness measure and the pattern generation 
mode have been well established could the supplemental measures 
be properly taken in and function effectively for the mining. This 
is the fundamental reason we need to restudy the issues of the 
two instruments as presented in this paper. We will see in the rest 
of this paper that by rationalizing the s(Z) and using the selctive 
generation mode only, we will achieve a great reduction of the 
size of the result pattern set. Our approach is then not a concise or 
representative approach in conventional sense, and the reduction 
from our solution is thus fundamental, natural and unconditional. 
In contrast, previous reduction approaches are supplemental and 
conditional - by means of imposing constraints or extra measures, 
etc., to realize a reduction. 

Since previous attempted remedies did not well solve the basic 
issues of the support and the generation mode satisfactorily, to save 
space, we won't discuss those remedies again in the next subsection 
but mainly the drawbacks presented in previous work. 


2.3 The drawbacks of the previous approaches 


As investigated in [50], from an observation point of view, the main 
drawbacks of conventional approaches exhibit in: 

1) Meaningless but overwhelming number of resulting patterns, 
some being even "counter intuitive" [43]. 
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2) The bias for generated patterns against the originally observed 
ones. For instance, from Table 1, S(V1) = 5 while S(V5) = 2, meaning 
Vi is more frequent than V5. However, Vs is an originally observed 
pattern but V; is not surely a pattern yet, since it is “generated” 
thus only a "possible" pattern. Furthermore, in the next section, we 
will see, even by generation, S(V1) could not surely reaches 5. 

3) The bias for shorter patterns against longer ones. This is 
formalized in the so called “downward closure" property [19] as 
mentioned before. Ultimately, the individual elements will be the 
most frequent patterns derived from the dataset. But in real world, 
it is common that compounds (patterns) are more frequently seen 
than single elements. For instance in a copper mine, pure copper 
(Cu) is much less frequently found than its compounds, e.g., copper 
oxide (CuO). 

From a deeper analysis, the above phenomena or drawbacks 
come from the following main issues of previous work: 

1) The improperly defined frequentness measure, the conven- 
tional "support". We discuss this and the related solution in the 
next section. 

2) A lack of formal definition on what a pattern is, and the infeasi- 
ble full enumeration pattern generation mode used conventionally. 

In a sense, most of the above issues can be traced to the proper 
definition thus a well understanding of the pattern, although so 
many papers about pattern mining in general have being published 
so far. We are yet not ready to give it a formal definition here either, 
but at least, we can intuitively understand that: 

A frequent pattern should be a configuration of the same elements 
appearing significantly in a dataset. 

Configuration means combinations, but not only that. Configu- 
ration implies stability, structures, and more importantly internal 
inherent connections, known or unknown, among the elements of 
a pattern. In pure frequentness based pattern mining, we do not 
have to consider the structure issue, and we do not even have to 
know the reasons for internal connection before or during the min- 
ing. However, the nature of such connections should be addressed. 
Indeed, in many cases, revealing such internal connection reasons 
could be the main interests and purpose to undertake a pattern min- 
ing. When the nature of internal connection among the elements 
of a pattern is addressed, it is easy to see the inappropriateness of 
the full enumeration mode. Further discussion about this mode is 
presented in Section 4. 


3 THEIMPROPER DEFINITION OF THE 
"SUPPORT" AND ITS SOLUTION 


As stated before, pattern mining is a probability and statistics based 
technology, and some people even take pattern mining to belong to 
the domain of statistics [18]. Particularly, the frequentness measure 
sz is the first link between pattern mining and the probability 
theory. However, under our investigation, a radical problem of 
pattern mining comes from the sz as defined in (2.1). It can be 
easily observed that the accumulative probability (the sum of s(Z)s) 
from a mining project as presented in previous work in general is 
much larger than 1, which then seriously violates the fundamental 
probability concept. We call this issue a "probability anomaly". 

A concrete example of the anomaly can be referred to Table 4 
(in Section 5) based on the dataset of Table 1, where >) sz > 11 > 1. 
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In real applications this probability anomaly is much more severe, 
and the reason of it will become clear in the next sections. Such 
probability anomaly can only be traced to the improper definition of 
Sz (2.1). Indeed, we can formally prove (though the proof could not 
be presented here due to space constraint) and find the equivalence 
of sz in both the classic frequency based probability theory and 
the multivariate probability theory [16][17], but neither of them 
justifies the use of s; as the proper frequentness measure in pattern 
mining. For instance, sz is equivalent to the absolute probability (or 
unconditional probability called alternatively) P(Z) in the classic 
probability theory. Such P(Z)s, for instance P(A) and P(B), could 
not be used directly to compare the relative frequentness of A 
and B. Instead, P(Z)s are used in the conditional probability and 
dependence study. For example, 


P(B|A) = P(AB)/P(A), (3.1) 
P(D|C) = P(CD)/P(C), 


where P(A) and P(C) are the absolute probability of A and C re- 
spectively. A numerical comparison between P(B|A) and P(D|C), or 
between P(AB) and P(CD) could not bring any semantic meaning, 
since they refer to different conditions. However, in pattern mining, 
szs are used for their direct comparisons in previous work. 

In the above view we can see an extended issue that, the “con- 
fidence" measure [1], [5], [19] used in association rules mining of 
conventional mining approaches is an application of the depen- 
dence analysis, since the measure, for instance, 

conf (A — B) = s(AB)/s(A), 
its right hand side is exactly the same as the right hand side of (3.1). 
That is, conf (A — B) = P(B|A). 

However, it is not appropriate to compare the magnitudes be- 
tween conf(A — B) and conf(C — D), for instance, simply be- 
cause their counterparts P(B|A) and P(D|C) are not directly compa- 
rable as just addressed above. This is an example that the reestab- 
lishment of the pattern mining theory will unavoidably impact 
other related mining work. Yet, the "confidence" issue is not to be 
discussed further here, since it is beyond the focus of this paper 
and because of the space limitation. 

A more understandable yet not well noticed point is that in the 
definition of sz = Sz/u, where u is the cardinality of the original 
dataset DBo, but the collection of the result patterns Zs should be 
seen as a new dataset out of DBo. We name such datasset as DBv, 
and suppose each pattern Z to be stored in a separate tuple of the 
DBv. Then the accumulative frequency (occurrence) of all the result 
patterns equals the cardinality of the DBv. That is, sz should be 
reformulated as: 


sy = Sz/w, (3.2) 


where, Sz represents the generated “occurrences” or “frequencies” 
of Z, F(Z), while w is the cardinality of DBv. That is, w = |DBv| = 
È Sz AF) 

With the above reformulation, >) s7 will automatically become 
1, and the probability anomaly is fully eliminated. Meanwhile, we 
can easily notice that s < sz will hold in general, while for most 
$28, $2 « sz will be true. This will become clearer in Section 5. 

We notice that there might be arguments against the above 
observations and the reformulation. A most possible one is that 
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the "supports" s(Z)s should not be directly summed up together, 
since, for instance, for a given pair of patterns A and B, they may be 
conjunct such that s(A) and s(B) cannot be directly added up. We 
appreciate that this argument implies an acceptance of the use of 
the probability theory to look into the problem. On the other hand, 
we notice that in the next section we will clear up such conjunction 
issue such that all s(Z)s can be directly additive. 

Another argument might be why should we uniform the sum 
of the supports into 1? We can answer this from different aspects, 
while a direct reflection is that this argument implies a refusal of 
the probability theory to view the s(Z)s. If so, one should give 
his/her reason so, and specify on what theory the support measure 
is established. In fact, for instance, when one took X as a frequent 
pattern with s(X) > 20% in an application, s/he indeed compared 
that 20% with the reference number 1. Otherwise such a 20% could 
be trivial compared with a reference number much larger than 1. 

Additionally, notice that the absolute support Sz is no better a 
choice than the relative sz. 

The above reveals that use of s(Z) is a major cause of too many 
frequent patterns exhibiting in a mining application, the first draw- 
back of previous approaches (subsection 2.3). This is because most 
s(Z)s are larger than their real frequentness as implied in previous 
paragraphs. Theoretically, such an inflated valuation of the pat- 
tern frequentness is referred as an overfitting matter. The formal 
concept and measurement of "overfitting" is originally from the 
theory of numerical statistics modeling, which however could not 
be directly adopted in pattern mining. A simple understanding here 
is that, overfitting means an over evaluation of a pattern frequent- 
ness, such that a spurious pattern is falsely taken to be significant 
one. Conversely, a true frequent pattern but falsely taken to be 
infrequent is termed as underfitting, due to an under evaluation of 
the pattern frequentness. In our study, the overfitting problem is 
widely presenting in previously work, as to be seen further in the 
following sections. 

The overfitting or underfitting problem is important since it 
determines the reliability of the mined results. Reliability is a widely 
used and discussed criterion in data mining community. Article 
[49] is an example wherein the problem of enhancing data mining 
reliability is addressed. However, formal and concise definition of 
data mining reliability and its measure is not readily available. In 
general, we would take the concepts used in classic statistical tests 
as a reference to express the mining reliability. These would include 
the stability of the mined results against data size change or data 
source change, and more importantly the degree of closeness of the 
mined results to the real values or structures embodied in the real 
world. For an unknown world, the said closeness can only depend on 
the soundness ofthe mining technology. The minimum requirement 
of the soundness should be the compliance of mining results with 
commonsense or already known facts; a higher requirement should 
be the conformability of the mining principles with other related 
established theories. 

The above analysis reveals that the conventional "support" is 
inconsistent with the basic concepts of the established probability 
theories thus needs to be reformulated and their summation must 
be equal to 1. What emphasized here is the importance of the 
establishment of a rational underlying measurement system for 
the mining, since, no science or technology can be recognized as 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 266 


IDEAS’19, June 10-12, 2019, Athens, Greece 


well-established without a well-established measurement system. 
For instance, the first part of a typical textbook of physics [70] is 
on its measurement system. 


4 THE APPROPRIATE PATTERN 
GENERATION MODE, SELECTIVE VS. FULL 
ENUMERATION 


The previous section explains why s(Z) is not the proper pattern 
frequentness measure and should be replaced with s’(Z), such that 
the probability anomaly will be automatically eliminated and the 
overfitting be greatly reduced. However, s’(Z) alone could not fully 
correct the overfitting or underfitting without a proper pattern 
generation mode. In the following, we examine the problems related 
to the full enumeration mode used by the conventional mining 
approaches and propose a selective generation mode to replace it. 

To proceed, we may want to see why pattern generation is per- 
formed. Assume V», V4, V7 and Vg represent egg, coffee, milk, and 
flour respectively in a market transaction T» of Table 1, and the 
customer who made this transaction indeed wanted to combine 
coffee with milk as one pattern, and use egg with flour to make cake 
as another pattern. Then T» truly contains at least two patterns. 
The purpose and significance of pattern generation is then to re- 
cover those patterns implied in a tuple. Based on this interpretation 
pattern generation is justifiable. However, the full enumeration 
generation over every tuple is not justifiable. This is because there 
will be three cases when a tuple is collected: 

a). The entire tuple is a single pattern. 

b). A coincidence for these elements to come together, i.e., the 
tuple contains no real pattern but a random walk. 

c). The tuple contains more than one pattern. 

We see only in the third case could a pattern generation be 
needed. Note that, similar to most of the previous work, for sim- 
plicity this paper ignores the second case and assumes at least one 
pattern would be produced from each tuple. What to be addressed 
here is, even in the third case, the full enumeration generation mode 
is still not generally applicable. To see this, we analyze why this 
mode is used by previous mining approaches. There might be two 
causes: 

The first cause could be the confusion between pattern formation 
in the real world and pattern generation from the abstracted classic 
dataset. Since the mining research was started from the market 
problem, in a real market transaction an item is usually in plural 
quantity or divisible or both. For instance, the item egg would 
represent a box (or boxes) of 6 or 12 eggs, each of which can be 
used for different purposes and times. Similarly, the item milk can be 
divided and used for different purposes and times. Full enumeration 
mode could be possible then. However, in the classic case mining, 
the dataset is static and the plurality and divisibility properties of 
the items got lost, leaving each item in a tuple being unique and 
atomic (the classic data nature). The static and classic natures make 
the full enumeration pattern generation mode impossible, since 
these natures disallow the reuse of any item to generate different 
patterns within the same tuple. This observation also reflects the 
nature of mining: a mining means to reveal and identify whatever 
patterns exist at the mining time only but not to be concerned with 
when such patterns had formed or to reshape later. In the above 
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example, after V7 had been used with V4 to form a pattern, V7 is 
consumed and could not be used again to generate other pattern(s) 
from that tuple in a mining. 

Notice that the classic data nature is reflected in the calculation 
of tuple length and pattern length. We can also see the reflection in 
the calculation of a single S(Z) in conventional approaches since 
S(Z) is incremented by 1 only for each occurrence of Z in a tuple 
(this may be easier to see when Z is a length-1 pattern, a single 
item), but collectively the classic data nature is violated since an 
item can be found in more than one pattern generated from a tuple. 

As we all know, for any theory, 
once its initial model including the 


Table 2 
concerned concepts, assumptions, hy- 
potheses and so on has been estab- TID | VID 
lished, the rest part of the theory must 
maintain those establishments, oth- n A,B 
erwise the theory will fall into self- n B,C 
contradiction and can never become T3 A, B, C 


sound. The above addressed issues ex- 
hibit the contradiction and unsoundness of the previous mining 
approaches. 

At this point, one may ask 
why not consider more prop- 


Table 3 
erties of the items in the min- 
ing model such that the above TID | vip 
discussed issues be overcome 
and we achieve a real world T A(2), BO) 
mining? The answer is simple: Tı B(1), C(7) 
doing so will change the min- T3 A(3), B(3), C(5) 


ing problem, that is, the min- 

ing methods and mining results will be substantially changed from 
the conventional ones, at the same time the mining complexity will 
be greatly increased to an unmanageable level. To see this clearly 
and for simplicity, hereunder we use a mini classic dataset of three 
elements and three tuples (Table 2), and add in the plurality factor 
only into it to form a new dataset (Table 3), to demonstrate the 
changes and the mining complexity. 

(1) The change of the design of the data structure and/or the 
database scheme to hold the added properties, as seen from Table 2 
to Table 3, where the parenthesized number in Table 3 means the 
quantity (units) of the related item. 

(2) The change of definition and calculation of S(Z) and sz. Con- 
ventionally from Table 2, S(B) = 3 and S(C) = 2, meaning B is more 
frequent than C, but from Table 3 the conclusion is reversed, since 
S(B) = 9 and S(C) = 12. This is a natural outcome, since quantity is 
taken into account, then the quantity matters. 

(3) A more serious change is on the pattern expression (formula) 
and the great increase of the number of patterns. From Table 3, we 
will have patterns like A2B3, or B3C4, or A; BsC; in general, where 
the subscripts are integers each representing the quantity of the 
respected item within a pattern. We call such a pattern the “complex 
pattern" or “general pattern". Note in the classic case mining, only 
the simple form patterns such as AB or ABC are concerned, because 
the quantity factor is ignored there. To our best knowledge, no 
article has touched the concept of complex pattern in the itemset 
mining reaserch. Obviously, the complex patterns are much more 
generally presenting in the real world than the simple patterns, 
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and the mining complexity of the complex patterns will certainly 
become unmanageable by any mining approach proposed so far. 
This is because the number of possible complex patterns formed 
from a fairly large number of items will be astronomical, if not 
infinite. Notice the complexity of the power set in the classic case 
mining is already hard to handle. 

Readers would have seen the complications now, especially when 
the divisibility and other factors were also taken into the mining 
model. The inclusion of the divisibility into the model will not only 
change the problem again, but also induce another difficulty to the 
researchers: what rules and how to set up the rules to render the 
division for different items! 

The above reflected what we stated before: we are still far away 
from the goal of real world pattern mining, since we even have not 
well pursued the simplest classic case mining yet, though so many 
mining research papers have been published so far. 

The second cause of the use of the full enumeration pattern 
generation mode by the conventional approaches could be the 
neglect of the difference between the "possible" patterns and the 
"realized" or "deliverable" patterns. A strong proof of this is that 
most early proposed approaches as summarized in Section 2 aim 
to output “all possible patterns" of sz > Smin (notice Smin can be 
zero), and the words *realizable patterns" or the like are seldom 
seen from those early papers, nor from later developed concise or 
representative approaches. That is, even if previous researchers 
had noticed the problem of too many result patterns, they did not 
well understand what the real reason behind is. Particularly, they 
did not consciously notice that the patterns to be delivered to the 
user must be firstly realizable from the mining, they thus could not 
propose effective approach to solve the noticed problem. Obviously, 
the realizable patterns will be much fewer than the possible ones. 
This can be seen from the example tuple {V2, V4, V7, Vg} again, from 
which V4, V2V4, V4V7 and V4Vs are possible patterns, but at most 
one of them can be realizable, because of the uniqueness of V4 as 
mentioned before. 

The realization concept is another key to see the inappropriate- 
ness of the definition and calculation of the support S(Z). If the 
original idea of the S(Z) is to mean how many tuples of a dataset 
to support a given pattern Z, it is fine for a single Z but collectively 
it is not. In the above example, if V4V7 is taken supported by the 
given tuple, then V4Vs or so could not be. In general, we are unable 
to determine the S(Z)s since we do not know how many and what 
patterns could be supported by a particular tuple before a sophis- 
ticated theory has been established, but conventional approaches 
take this as an easy job by assuming each tuple could support all 
the possible patterns from the elements it holds, as implied by the 
full enumeration mode. 

The analysis of the above two causes would have helped us un- 
derstand why the full enumeration pattern generation mode is not 
feasible in the classic case mining. Notice additionally, even in real 
life the full enumeration mode cannot be arbitrarily assumed since 
real pattern formation is attribute constrained. For instance, to form 
coffee and soap into a pattern does not make much sense. How- 
ever, since the classic dataset is de-semantic, as in most previous 
research, such constraint is ignored herein. We now focus on the 
feasible pattern generation mode and its properties over the classic 
datasets. 
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Since pattern is generated from each tuple, we take tuple T, of 
Table 1 to start the discussion. The tuple holds elements V, V4, and 
V7. Theoretically 7( 2? — 1) “possible" patterns can be generated 
from it, but practically at most three out of the seven patterns can 
be "realized" and delivered in a particular mining operation. This 
is because, in an application, only one of the following optional 
pattern sets can be realized (delivered) from the tuple: 

Example 4.1: The pattern set delivery options of the given 3 
elements: 

a) {ViVaV7}; 

b) (Vi, Va, V7}; 

c) (Vi, VaV7}; 

d) {Va,ViV7}; 

e) (V, Vi Va}. 

Note that the "delivery" mentioned above and hereafter refers 
to a full set delivery without a consideration of the effect of smin 
unless otherwise specified. 

From the above option list we can see that, the first two pattern 
sets are trivial: either a) all elements assemble together as a single 
pattern, or b) each element stands separately as a pattern. For any 
nontrivial pattern set from c) to e) to be outputted and delivered in 
an application, the number of patterns is only 2, less than the num- 
ber of the elements, but conventional mining approaches take all 
the 7 patterns realizable and deliverable, since all the 7 patterns will 
be outputted and the frequency of each of them will be incremented 
by 1 in the result pattern set. 

The above observation can be extended to a set of any number 
of elements, and we have the following: 


THEOREM 4.1 (PATTERN SET SIZE THEOREM). The number of non- 
trivial patterns to be generated and delivered from a given data tuple 
in a particular application is less than linear to the tuple size. 


Pnoor. For a given set of finite k (k > 1) distinct elements, 
generating patterns from it equals to partitioning the elements into 
different groups, thus each element can be in one group only. It is 
then obvious to see that, the largest possible number of groups is k, 
wherein each element stands alone as a group (a pattern). However, 
such a partition and the resulting pattern set are trivial. Therefore, 
in any nontrivial output set the number of result patterns can only 
be less than k. o 


Note importantly in the above proof, the pattern generation in- 
deed can be seen as a partition of the original element set. Formally, 
A “partition" { H; } of a given nonempty universe Q means that 
[52] [53]: 

U; Hi = 9, and Hj QH; =, 
where |{Hj}| =m > 0;1 < j,k € m,andj # k; each Hj is nonempty 
and called a “part” or “block" of the universe, or a "hypothesis" of 
the partition. 

We define the partition based generation as the “selective pat- 
tern generation mode" (or in short, “selective mode"), meaning 
an element may or may not be drawn to form a pattern with other 
element(s) of the same tuple, depending on what partition to choose. 
For instance, each delivery option listed in Example 4.1 corresponds 
to a particular partition of those given three elements. The mining 
approach based on this selective mode is called “selective pattern 
generation based mining approach", or simply “selective approach". 
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We notice that only this selective mode is feasible, while the con- 
ventional full enumeration based generation mode is infeasible. To 
see this, we introduce the following fundamental criterion first. 
Similar to the well-known “chemical equilibrium equation", we 
can have an “input-output equilibrium function" (or simply 
“equilibrium condition") for pattern generation as stated below: 


PROPOSITION 4.1 (EQUILIBRIUM CONDITION). The total count C(e;) 
of each element e; embodied in the resulted pattern set must be equal 
to (or no more than) the total count S(e;) of the same element from 
the original element set (e.g. a tuple). That is: 


C(ei) € S(ei). (4.1) 


The “less-than" operator expressed in (4.1) is applicable to the 
case that some element(s) of the original element set are random 
walk and could not be formed or included in a pattern. 

Example 4.2: Any of the delivery options of Example 4.1 satis- 
fies the above equilibrium function. For instance, the last option e), 
which means: 

The original elements: (Vi, V4, V7} = The elements of output: 
(V, Vi Va}. 

Consider the element V;, from the left hand side of the above, 
S(V;) = 1, and from the right hand side C(V7) = 1, such that 
C(V7) = S(V7) holds. 

However, in the conventional full enumeration mode, the above 
equilibrium condition does not hold: 

[Vi, Va, Vj] > {V1, Va, V, Vi Va, V V7, Va, Vy Va Va]. 

And consider the element V; again, from the left hand side of the 
above, S(V7) = 1, but from the right hand side C(V7) = 4 > S(V7), 
violating (4.1)! Furthermore, in a real mining application, we will 
get C(e) > S(e) as can be imagined and the equilibrium condition 
will be seriously violated. 

The above violation can be interpreted in other way: originally, 
only single Vi, V4 and V; are given, but in the conventional approach 
and the 7 patterns produced, it means there should have been 4 Vis, 
4 Vas, and so on to form or to "support" the 7 patterns. It is certainly a 
lossy business if a miner received the original 3 elements but turned 
the 7 patterns back to the user according to the full enumeration 
mode or any other mode violating the equilibrium condition (4.1). 
However, interestingly conventional mining work all take this lossy 
approach. 


COROLLARY 4.1 (EQUILIBRIUM CONDITION-2). For a given dataset 
or any number of tuples of it, let the total number of all elements 
forming the result patterns C; = Y; C(ej), and the total number of 
original elements from the tuples S; = >); S(ei), then 


Ct € St. (4.2) 


The semantics of the above corollary is obvious: what outputted 
could not be more than the inputted. The equilibrium condition 
implies both (4.1) and(4.2), where (4.1) means an elemental view 
while (4.2) as the summation of the both sides of (4.1) represents 
an aggregative view of the condition (as such, (4.1) implies (4.2) 
but not vice versa). Particularly, for a single tuple t, S; = bz, the 
length of that tuple, and C; = Y; |Z!|, the sum of the lengths of the 
patterns (Z d generated from that tuple (i is a cardinal number). 
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For an entire dataset, 


t=u 
St = Py bt, and 
t=1 
Ci = $ az rZ») 


= ks 9 FD) = Dik * Hi) 
k i k 


where Z 4 is any pattern of length k from the entire pattern set, 


and H k =>; F(Z 2 is the sub-accumulative frequency of all the 
patterns of length k from the entire dataset. 

Since the equilibrium condition is very fundamental, it can be 
very useful in applications, especially in checking the correctness of 
a pattern generation from either a single tuple or a group of tuples 
or an entire dataset. One can easily take the running example (Table 
1 to see how the conventional approaches violate this equilibrium 
condition. 

Furthermore, the equilibrium condition is also applicable to the 
case that plurality (repetitions) and/or divisibility of any object 
in a mining needs to be considered. In this case, we only need to 
change the word "count" (meant by C(e;) and S(ej)) into "quantity" 
or “volume", or the like, in an application. 

More importantly, the equilibrium condition can be used to jus- 
tify the proposed "selective pattern generation mode" and Theorem 
4.1, as seen below. 


THEOREM 4.2. The result patterns from a tuple of the classic dataset 
can only be a partition of that tuple. 


Pnoor. Due to the classic data nature and based on the equi- 
librium condition, no element can belong to more than one result 
pattern from the same tuple. As such, the result patterns from a 
tuple can only be pair-wise exclusive and no super and sub-pattern 
relation could exist within a tuple. This is the same to see that the 
result patterns from a tuple can only be a partition of that tuple 
(refer to the formal definition of the partition on the upper part of 
the left column). [1 


The above theorem is equal to saying that only the partition 
based selective pattern generation mode is feasible. 


THEOREM 4.3. For a given data tuple, in the classic case and the 
selective pattern generation mode, Theorem 4.1 and the equilibrium 
condition are mutually necessary and sufficient. 


Pnoor. To save space, we prove only one part of the above 
theorem: In the classic case Theorem 4.1 holds iff the equilibrium 
condition holds under the selective generation mode. 

(1) The necessary condition can be proved by contradiction: 
Assume Theorem 4.1 hold but (4.1) do not, i.e., there exist at least 
one element x such that C(x) » S(x), which means more than one 
pattern would contain x, or a pattern would contain more than 
one x. However, such pattern(s) could not be produced from the 
selective pattern generation mode by its definition in the classic 
case mining, and Theorem 4.1 could not hold thus a contradiction 
of the assumption. 

(2) The sufficient condition can also be proved by contradiction: 
Assume the equilibrium condition hold but Theorem 4.1 do not, 
i.e., (ignore the trivial case) the number of generated patterns be 
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more than the number of the original elements. Since a pattern 
contains at least one element (as noted before), the total number of 
elements to form the patterns must be more than the number of the 
original elements, such that C; > S; happens, which violates the 
equilibrium condition (4.2) and contradictory to what assumed. o 


The above theorem means, once the equilibrium condition holds, 
the pattern set size of Theorem 4.1 would automatically be satisfied. 

The concept of the selective mode is simple, although how to 
render this mode would be discussed in a future paper. This simple 
concept based mode will have important effects and significances 
for proper pattern mining. The most significant one is the theoreti- 
cal removal of the concern of conjunction issue in the additivity of 
the pattern frequentness, as shown below: 


THEOREM 4.4 (DISJUCTION PROPERTY THEOREM). The patterns 
produced from the selective generation mode are conjunction issue 


free. 


Pnoor. For a single tuple, as stated in the proof of Theorem 
4.2, patterns produced from this mode is pairwise-exclusive thus 
conjunction free. This can be verified by any delivery option of 
Example 4.1. 

For the whole pattern set produced from an entire dataset, there 
could be some patterns containing the same element(s), for instance 
V1 V5, named as A for simplicity, and V2V3, named as B, from Table 
1, but they can only be produced from different tuples because of 
the above reason. This means there will be zero instance of their co- 
occurrence from the entire dataset, thus the probability P(AB) = 0, 
which is of the same effect of exclusivity of two patterns. o 


Due to the disjunction property of the patterns produced from 
the selective mode, and based on the probability theory [16] [17], 
the direct additivity of the pattern frequentness and the remedy 
to the probability anomaly stated previously are now fully closed 
with theoretic proofs. 

An important notice here is that, although within the same tuple 
patterns are pairwise exclusive thus no super-pattern and sub- 
pattern relation exists as stated in the proofs of Theorem 4,2 and 
4.4, such relations do exist in the entire dataset. The only thing is 
that those super and sub patterns come from different tuples as 
implied in the second part of the above proof. This is another major 
point that was not clearified in previous approaches 

In addition to the above theoretical contributions, the practical 
important effect of our solution through the selective mode is the 
large reduction of the number of resulting patterns from the power 
set to less than linear (by Theorem 4.1), especially when the number 
of elements of a tuple is large. For instance, take a data tuple length 
of 40, not a very big number yet compared with that can be found 
in many data sources, e.g., in the mentioned dataset reservoir [47], 
by conventional approaches, more than a trillion patterns will be 
generated from the tuple and each pattern’s frequency will be 
incremented by 1, but in the selective mode and from Theorem 
4.1, at most 40 patterns could be rationally produced so. This is 
a striking difference and an evident approach to understand why 
too many (meaningless) patterns are usually produced from the 
conventional mining approaches. 

Another important effect is the change of the frequencies thus 
the frequentness of the patterns from that of the previous work. The 
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changes mean a reordering of the frequentness of the patterns. This 
is obvious since the number of times a pattern to be produced from 
the data tuples will be changed from the full enumeration to the 
selective mode. And notice that the relative order of a pattern aginst 
other patterns is more important than the frequentness number 
itself, since the frequentness numbers serve for ordering. 

As an illustration of what stated in the above two paragraphs, 
suppose in an application, by conventional mining approaches 
the ordered result set be (A, E, D, B, F, G, C}, while in the selective 
mining from the same dataset, the ordered result set be (D, F, A}. 
That is, both the number of patterns and the frequentness order of 
the patterns changed. 

The above means that the mining results from the proposed 
solution will be substantially different from that of the conventional 
mining approaches. Those approaches differ from each other mainly 
in algorithms but they all produce the same result pattern set, other 
thing being equal. 

Detailed analyses of the effects of the selective mode together 
with the use of the reformulated support s7 are given in the next 
section. 


5 EFFECT ANALYSES OF THE PROPOSED 
SOLUTION 


The reformulated pattern frequentness measure s7 combined with 
the selective pattern generation mode proposed in the previous 
sections forms the fundamental solution for the issues addressed 
in Section 2 and 3, especially the central overfitting problem. As 
a summary, the main functionality and advantage of the selective 
generation mode is in the reduction of the meaningless patterns, 
but this mode must be used with the new measure s; together 
otherwise the overfitting problem would still be retained if sz is 
used. The main functionality and advantage of the new measure 
s} is in the reduction of the overfitting because of its remedy to 
the probability anomaly. However, if only the sz is replaced but 
the full enumeration mode is still in use, the following problems 
exist: the conjunction issue will remain and the direct additivity 
of the pattern frequentness hence the remedy to the probability 
anomaly would still be questionable. Secondly, s}, alone could not 
fully eliminate the overfitting, and it may cause underfitting at the 
same time. Thirdly, the order of the pattern frequentness would not 
change. The details are given below, and the conclusion is that the 
combined solution must be used either theoretically or practically. 

To compare sz and s}, we need to get the accumulative pattern 
frequency w. In the rest of this paper, w will be used for either 
selective or full enumeration pattern generation mode if there is no 
ambiguity, while wo will be used as a special value of w particularly 
for the full enumeration mode. We note that wo can be obtained 
without rendering the pattern generation as shown in the next 
subsection. We present the derivation of wo since it may be used to 
compare with a w from a particular selective generation approach. 


5.4 The computation of wo 


The accumulative raw frequency wy of all possible patterns under 
the full enumeration mode can be obtained precisely before the 
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pattern generation as follows: 


jzu i-bj j=u j=u 
w 25 G = -D)W -u — (51) 
j=l i=l j=l ja 
where b; = |Tj|, the number of elements in a tuple T;, and u = 


[DBo|. 
Now, define gy. as the number of tuples each holding k elements 
in the original dataset DBo, such that 


k=a 
ED (5.2) 
k=1 


where a = max(b;). 
(5.1) then can be further simplified as: 


j=u i-bj k=a i=k 

_ 1 7 i _ h 1A i 

wo = Cy; i 20 C,) 
j=1 i=1 k=1 i=1 
k=a k=a 
k 

= an - 1) = J, 9 Ak (5.3) 

k=1 k=1 
where, 
Ap = 2* -1, 


which represents the number of all possible patterns and hence the 
sum of their incremented frequencies enumerated from a tuple of 
length k. 

The simplification of (5.1) to (5.3) reduces the number of expo- 
nent operations from u to a: commonly @ « u in real applications. 
Furthermore, the exponent operations can be completely avoided, 
and wo can be calculated in a recursive approach (not to present). 
We note that the exponent operation is not a big issue in terms of 
computation cost. However, the computation cost of the addition 
operations of (5.1) is more than linear to u thus could not be ig- 
nored when u is very large. For instance, if u is in trillions or even 
larger, then computation of (5.1) may take hours or days with a 
current desktop system. However, the computation cost of (5.3) can 
be seen as near constant and negligible, since a is relatively very 
small and usually would not be over a hundred or a thousand in an 
application. 


5.2 Overfitting/underfitting quantifications 


In Section 3, we have shown how s; defined in (2.1) is reshaped as 
s} in (3.2), and how the probability anomaly is primarily eliminated. 
Here, we present how the degree of overfitting or underfitting of 
conventional support sz could be quantified against the reformu- 
lated one. We first define a primary overfitting or underfitting ratio 
rs, depending on whether rs > 10rr; < 1, for the quantification in 
the case that both sz and s} are from the full enumeration pattern 
generation mode, and 


rs = Sz/s. (5.4) 
We then get: 


Ps = Sz/57 = (Sz/u)/(Sz/wo) = wo/u = Ao, (5.5) 
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where Ao is the average sum of frequencies of patterns generated 
per tuple of the DBo. That is: 


w 1 k=u 1 k=a 
= 0 = — 
dy = = = 2; Ak 2; PV (5.6) 


A merit of the Ao and rs is that they can give us an immediate 
numerical magnitude of the overfitting ratio, since wo can be ob- 
tained from (5.3). Furthermore, as we notice that, Ao is increasing 
with the increase of the overall data tuple length, and so is rs. It 
means then that the overfitting issue will be severer than that over 
datasets of shorter data tuples in conventional approaches. 

Note that the overall data tuple length mentioned above is not 
exactly the average data tuple length with the full enumeration 
mode, but the dominant length of top long data tuples in a dataset. 
This is because: in this mode the number of patterns to be generated 
is exponential to the tuple length. We will touch this issue a bit 
further in the later subsections. 


From the above we see that rs > 1 always holds since wo > u 
(if wo = u it means no pattern generation), and r, can be very large 
if the overall data tuple length is large. As an example indicated 
before, if the overall tuple length is 40, rs will be in the range of 
trillions. It then justifies our previous assertion that overfitting is 
inherently embodied in previous mining approaches and strongly 
disqualify the extensively used conventional support sz. 

Note also that the overfitting ratio for the unnecessarily pro- 
duced patterns in the conventional approaches will be theoretically 
extremely high (up to infinity co) since their frequencies Szs will 
be zeroed in the selective mode. However, this infinity ratio is not 
reflected in rs, since rs is concerned with all the patterns produced 
in the selective mode only. 

On the other hand, Ag hence rs expressed in (5.6) is the upper 
bound ofthe overfitting ratio for real patterns with the conventional 
generation mode, thus the degree of overfitting for a real pattern 
would be lower than that from (5.6). This is because wo includes 
frequencies of those unnecessarily generated patterns from the 
full enumeration mode. To be clearer, if s; is still used, the related 
overfitting ratio r; is given below: 


E / 
rt = Sz[5; selective 


= (Sz/u)/(Sz/w) = w/u=A>1, (5.7) 


where r; > 1isdue to w > u because of pattern generation although 
w < wo. Notice that, simpler than Ao, À can be approximately taken 
as the average tuple length of the entire dataset. This approximated 
A indeed is the upper bound of À based on the “pattern set size 
theorem" 4.1 with the selective pattern generation mode. This is 
because, the upper bound of w: wup = 2o bj, and À < Aup = 
Wup/u. 

In comparison with Ao (5.7), normally A « A9. This means the 
overfitting is much milder than that in the conventional approach 
even if sz is used, and this is a direct benefit of the use of the 
selective mining approach. At the same time it proves what was 
stated above: the measurement of the overfitting will be lower down 
if the number of unnecessary patterns are less produced such that 
wo is reduced to w as done in the selective mode. 
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Table 4: Comparisons of the resulted parameters based on data of Table 1 


k #Zk | È sz Msz Msz Msg Ex. Patterns Sz | Sz Sz Sz | sz | Sz 
A B E Ea E' Ea' T X |y Zz x |y z 

1 8 2.56 0.33 2.8 0.24 V4 4 45 .039 5 5 .042 
2 18 3.33 0.36 3.6 0.31 ViV2 2 22 .019 3 3 .025 
3 21 2.9 0.22 3.0 0.25 V2V4V7 2 22 .019 2 2 017 
4 [15 |18 0.08 1.7 014 | V1V2 V3Vg 1 |a [00 |2 |2 | .017 
5 6 0.67 0.01 0.6 0.05 ViV2V3V4V7 1 11 .010 1 .1 .009 
6 1 0.11 0.01 0.1 0.01 V1V2V3V4V7 Vg 1 11 .010 1 d .009 
Di 69 11.4 1.00 11.8 1.00 


Not only will the overvalued wọ lead to overfitting, but also it 
induces undervalued s7. That is, s} is an overcorrection of sz if wo 
is used, and s; will bring a underfitting problem to real patterns 
which keep the same Szs in either mode. However, the degree of 
the underfitting will be smaller than that of the overfitting with the 
conventional support. Consider the underfitting ratio: 


, d 
ru = Sz |Full-enumer./Sz|selective 


= (Sz/wo)/(Sz/w) = w/wo < 1, (5.8) 


The inverse of r,, should be used to mean the degree of underfit- 
ting, and the degree is thus defined to be: 


ry = 1/ry = wo/w> 1. 


The following compares the degrees of underfitting and overfit- 

ting: 
rulTs = (wo/w)/(wo/u) = u/w « 1. 

Since u < w (otherwise no pattern generation), the above proves 
that the degree of underfitting is lower than that of overfitting for 
real patterns. 

Note for patterns from the selective mode assigned with smaller 
frequencies than 5;s from the full enumeration mode, as can be 
inferred from (5.8), rų < 1 does not generally hold. Accordingly, 
those patterns with r, > 1 would still be overfitted in the full 
enumeration mode even if s} is used. 

Now, since wo > w > u, the above implies the following rela- 
tions: 


ry rg» rp». (5.9) 


That is, only by using the selective pattern generation mode, could 
the overfitting and underfitting both be minimized. 

As a summary for the full enumeration based approaches, if the 
Sz (2.1) is used, overfitting exhibits persistently without underfit- 
ting, and the longer the overall data tuple length, the severer the 
overfitting. When s; (3.2) is used, real patterns will get underfitted, 
and the longer the overall data tuple length, the more serious the 
underfitting, while the degree of it will be much milder than the 
use of sz. Lastly, in the full enumeration mode, even if s7 is used 
and underfitting happens, overfitting is still unavoidable, because, 
at least the large number of unnecessarily generated patterns as 
stated before are overfitted ones by nature. In general, to overcome 
both underfitting and overfitting problems, the selective generation 
mode should be used. 


5.3 Numerical comparisons between s; and s; 


For a more intuitive understanding of the difference of the eval- 
uation of the pattern frequentness in conventional and the refor- 
mulated sz, we present the related comparisons in Table 4 based 
on the data given in Table 1, and both sz and s} are from the full 
enumeration mode. This is the only we can do presently, since, the 
implementation of the selective mode, thus the s/ from this mode, 
can only be presented in a future work. Even so, the comparisons 
presented hereunder would still be informative. 

In Table 4, column B is the subtotal number of patterns of the 
same length; column E shows the sum of szs of patterns of the 
same length, as well as the overfitting ratios (the last row) against 
the new s}, based on the first 9 tuples of Table 1. The semantics of 
column E' is the same as that of column E but based on all 10 tuples 
of Table 1. Column Ea and Ea’ present the sum of s/s of patterns 
of the same length of the 9-tuple and 10-tuple cases respectively, 
where the probability anomaly is eliminated. The other columns 
starting from column T until the last show some example patterns 
and their related raw frequencies (x) and frequentness in terms of 
sz and s} (column y and z respectively), where column x, y, and z 
are the results from the first 9 tuples, while column x’, y’, and z' are 
the results after the last tuple being added into Table 1. 

As addressed before, overfitting is dominant in conventional 
approaches, exhibiting with two typical symptoms: too many fre- 
quent patterns and unstable mining result set. s} greatly remedy 
the first symptom. For this small example and under the conven- 
tional regime, the number of frequent patterns with smin = 20% is 
23 from the 9 tuples and 32 from the 10 tuples, a 40% increase for 
Smin = 20%, which illustrates that the result set is very unstable. 
The main cause of it is the probability anomaly which results in an 
overfitting ratio rs > 11% in either 9 or 10-tuple case (the grand 
sum of s; in column E or E' means this ratio). More strikingly, at 
Smin = 10%, all of the 69 patterns are frequent from either 9 or 10 
tuples in the conventional case! 

Contrarily, in s7, because of the merits of its removal of the prob- 
ability anomaly there is no frequent pattern even at smin = 10% 
from the entire dataset. The result is compliant with an intuition 
that we could not mine a big number of frequent patterns from such 
a small element set and small database at a fairly high threshold 
Smin (e.g., at 10% or higher). Although as noticed in the previous 
subsection we need to consider the overcorrection, i.e., the under- 
fitting effect of the s7 in the full enumeration case, the effect is not 
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much serious, since the overall data tuple length in this example is 
not large compared with that of real application datasets. 

s} also (partially) remedies the second symptom, the unstable 
mining result set. This is because the decrease of sz to s} reduces 
the number of patterns to pass the given threshold smin to enter 
the result set when the data size is increased. The more complete 
remedy to the instability is the use of s} with the selective pattern 
generation mode together, since this mode reduces the number of 
patterns to be generated. Furthermore. with this mode, the reduc- 
tion rate from sz to s} will not be linear nor the same for different 
Zs. 

Other features can be further inferred from the table: when using 
s}, the more frequent a pattern is, the more stable is its frequentness 
as the data size changes, as shown in column z and z' of Table 4. 
This is what is normally to be expected: with data size increasing, 
the frequentness of every pattern approaches asymptotically to its 
natural degree. In addition, sz in general increases faster than s;. 
The theoretical reasons for these observations are given in the next 
subsection. 


5.4 Theoretical comparisons between s; and s; 


The above observations can be formalized as a theorem as follows: 


THEOREM 5.1. s} outperforms sz in the following aspects: 

(1) s} increases slower than sz if Aw > A, where Aw is the added 
accumulated frequency produced from the added data tuples, 
and A(= w/u) is the average accumulated pattern frequency 
per tuple. 

(2) As long as the added data tuple contains Z, s; always increase, 
while s, may not, or even decrease. 

(3) A larger sz will increase slower than a smaller s7. 


In the following proof to the above, w and À are generally used 
for either full or selective pattern generation mode, unless wo and 
Ào need to be specified. 


Pnoor. Initial u, w, sz and s; for a given pattern Z and its raw 
frequency Fz. Now suppose one data tuple added into the dataset, 
Au = 1. It then could cause Fz to increase at most by 1, or AF; = 1, 
since one data tuple can generate a particular pattern once, but it 
could cause w to increase larger than 1, i.e., Aw > 1, since in general 
more than one pattern will be produced from a tuple, otherwise the 
problem becomes trivial. Then: 


Asz/sz = A(Fz/u)/(Fz/u) 
= (uAF; — F,Au)/u°)/(Fz/u) 
= AF,/F, — Au/u = 1/F, — 1/u, (5.10) 
and 
Asz/s, = A(Fz/w)/(Fz/w) 
= (WAF; — Fz Aw) w^) (Fa [|w) 
= AF,/Fz — Aw/w = 1/Fz — Aw/w. (5.11) 
Since w = Au, where À is the average accumulated pattern fre- 


quency per tuple (refer to (5.9) and (5.6)), the above can be refor- 
mulated as: 


As; [s = 1/Fz — Aw/Au 
= 1/F, —(1/u)(Aw/A). (5.12) 
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(5.11) and (5.13) state that: 

(1). Asi /s < Asz/sz, as long as Aw > A, which then proves 
the first conclusion of the thoerem. At the same time, it implies 
importantly, to keep Asz/sz comparable with As7/sz when data 
size increases, that is, to slow down the rapid increase of sz, the 
data tuple length will ultimately decline toward 1. 

Note that, by full enumeration mode, Aw can be much larger than 
A (or unambiguously, Awọ > Ao) when a relatively long tuple is 
added in. Furthermore, since a bunch of patterns, including existent 
patterns, can be generated from the added tuple, it means the added 
tuple may cause a number of patterns’ frequencies to be increased 
and ultimately get them to become frequent ones. Here the existent 
patterns are those generated before the addition of the new tuple. 
This explains how instability of mining result set and the overfitting 
problem take place at the same time in the full enumeration case. 
In the running example, Ao is about 12 for the first 9 tuples. When 
the 10th tuple (of 4 elements) is added in, Awọ = 15 > Ao, and we 
have seen before how it causes a sudden increase of the number of 
frequent patterns thus an unstable mining result set, while now we 
get its theoretical reasons. 

(2). (5.11) tells As, /sz > 0 always holds, since Fz < u (if Fz = u, 
Z can be fully removed from the dataset, since every data tuple 
holds Z). However, (5.12) and (5.13) indicates that, even if an added 
tuple makes F; increased (by 1), As} /s} can be either positive or 
negative depending on whether the following condition hold or 
not as derived from (5.12) : 


Aw * Fz < w. (5.13) 


If it holds, As7 /s/ increases, otherwise decreases (ignore the equal- 
ity case). It then proves the second conclusion of the theorem. This 
implies, similar but reverse to the overfitting issue of the above 
paragraph, under the s/ regime and full enumeration mode, a fairly 
long added tuple thus a large Aw may cause a number of existent 
patterns' frequentness to decrease and hence underfitting happens, 
but the degree of the underfitting will be smaller than that of the 
above mentioned overfitting. We have proved this in subsection 5.2, 
and we can also prove it with the above formulas, but we do not 
have to do so. 

Onthe other hand, we can see the above inequality (5.13) is easier 
to hold with a smaller Fz than larger one. This means another 
difference from the case of point (1) above that an added tuple 
can lead sz to increasing only, but here it may cause some (less 
frequent) patterns’ s} to increase while others’ decrease. This is 
indeed a more proper reflection of the effect of data size changes, 
since such a change should alternate the comparisons of some 
patterns' frequentness. These two aspects together explains why 
the result set from the reformulated support s4 is more stable than 
that from the conventional sz 

The above two points not only address the cause of overfitting 
and underfitting, but also reveal a data homogeneity issue. If the 
lengths of different tuples of the same dataset vary too much, then it 
may affect the proper measure of the mined patterns' frequentness 
hence the reliability of the mining. This is similar to the homogene- 
ity problem in numerical statistic modeling: if the magnitudes of 
the data (numbers) differ too much in a sample, then it would be 
difficult to derive a reliable model from the data. 
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(3). Since a smaller Fz would let (5.13) easier hold, while a smaller 
Fz means a smaller s7, then (5.12) to (5.13) implies that a smaller s; 
increases faster than larger ones. It thus proves the third conclusion 
of the theorem as well. To be clearer, from (5.12), As} /s, = 1/Fz — 
Aw/w, for a given Aw/w, a smaller F; thus smaller s7 would give 
a larger value of the left hand side As? /s7. 

Although on the above we added only one data tuple in the 
proof, the proof can be easily generalized to additions of multiple 
tuples. o 


The above proof not only justifies the outperformance of s/, 
over sz, but more importantly reveals a number of interesting and 
intrinsic properties underlying pattern frequentness measure. For 
instance, the tuple length variation issue is seldom addressed by the 
previous work. Further more, the use of the a selective generation 
mode will reduce the effect of Aw hence the effect of the tuple 
length variations. 


5.5 The significance and impacts of the 
proposed solution 


The first significance is that we have found a solution for the prob- 
lems identified in Section 2 and 3. The findings explored in our 
problem investigation and analysis indicate that, although pattern 
mining or “knowledge discovery from database (KDD)" in general is 
fact based (what a database holds are facts or experimental results), 
without a proper theory and mining strategy, we could not obtain 
reliable but misleading mining results. This paper and our solution 
is an attempt to clarify and correct some fundamental concepts so 
as to ultimately improve the mining reliability. 

Another significance is the effectiveness of our solution. A well- 
established measure should be featured with at least rationality and 
simplicity. The simplicity implies easiness in understanding and 
in use. The proposed s’(Z) keeps the simplicity while rationalizes 
the previous s(Z) and remedies the probability anomaly. The use of 
s’(Z) together with the selective mode could effectively get around 
the overfitting/underfitting and other issues addressed in Section 2 
and 3. In particular, as indicated in the literature survey presented 
in Section 2, reducing the result pattern set size is a major research 
interest in previous work, and attempts have been made, e.g., [33] 
[48], but the reduction is not more than a few orders in magnitude, 
while with our new enumeration mode the number of patterns will 
be reduced from the power set to less than linear to the tuple size. 

Furthermore, the equilibrium condition, the clearance of prob- 
ability anomaly and the use of the selective pattern generation 
together forms a set of three rationality check criteria. Since these 
criteria are derived from the simplest classic case mining with- 
out involving any exogenous measure or constraint, every mining 
approach should comply with these criteria. In comparisons, read- 
ers would find that no previous approach did or could claim the 
satisfaction of these criteria. 

Other consequences of the solution and criteria include: 

1). Because of the equalization of the pattern frequentness and 
the probability measure of events, we can use 3% or 5% to be the 
frequentness threshold as used in various research and applications, 
though not formally required [45], such that there is no need to 
bother user (unless s/he likes) to define an smin. 
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2) Under the s} regime and because of } s, = 1, ina case of large 
number of patterns, their frequentness distribution is more analo- 
gous to the (continuous) probability density distribution rather than 
the (discrete) mass probability distribution in probability theory. 
When using a 3% threshold for infrequent patterns under s; regime, 
it refers to all those patterns whose cumulated frequentness is less 
than 3%. Similarly, for the top 10% frequent patterns, it means their 
accumulated frequentness is no less than 10%. 

3). The above impacts and consequences will unavoidably prop- 
agate to other mining applications based on pattern mining, e.g., 
association rules mining, causation mining, etc. 

4). In practice, the more reliable mining approach and the greater 
reduction of the mining result set induced from the proposed so- 
lution would substantially simplify and facilitate business policy 
selection and decision making based on pattern mining. 

5). The concepts and solution presented in this paper could also 
lay a foundation for more complicated mining applications, e.g., 
when plurality and/or divisibility issue needs to be considered in a 
mining, which we call the “plurality case" mining. In this case, an 
element represents a type of “objects", similar to that in chemistry 
study wherein an element represents each and all “atoms" of the 
same type. For instance, the element “O" represents each and all 
the oxygen atoms, and “C" for all carbon atoms, and so on. In the 
plurality case, complex patterns as mentioned in Section 4 will 
be generally produced, and we can follow the molecule formula 
approach to express such patterns. 

For simplicity, the divisibility can be included in the “plurality 
case" mining as well, with an assumption that we could find a way 
to divide the concerned divisible element into its possible smallest 
parts each being treated as an atom upon a pattern generation, 
then the divisible elements can be taken as of the same plurality 
property as others. As such, the itemset mining could be more 
suitably studied in the plurality case mining. 

With the above ideation, what established in the classic case 
mining could be easily extended into the plurality case mining. For 
instance, we have seen that the equilibrium condition established 
in Section 4 can be extended to the plurality case. Here we add that, 
the selective pattern generation mode and the pattern disjunction 
property theorem (Theorem 4.4) can be extended to the plurality 
case mining as well: 


ConorLARY 5.1. The partition based selective pattern generation 
mode is also the only feasible mode for the plurality case pattern 
mining, with the only notice that this mode is applied to the object 
(atom) level. 


The rationale of the above corollary is that, in the plurality case 
the smallest constituent unit of the pattern is the object (atom), not 
the type, and we thus need to look at the pattern composition in 
terms of atoms. In the atom level, each atom is unique and cannot 
belong to more than one pattern at a time. Then pattern generation 
from those atoms stored in a tuple can only be equal to a partition 
of those atoms, thus the selective mode and only this mode is 
applicable in the plurality case mining. 


ConorLARy 5.2. The pattern disjunction property is applicable to 
the plurality case pattern mining as well, since the patterns can only 
be produced from the selective generation mode. 
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The above corollary would be easy to understand, since its proof 
will be the same as that of Theorem 4.4. As such, the direct additivity 
of the pattern frequentness can be applicable to the plurality case 
mining. As an example of the application, the corollary answers 
why in industrial mining practice, the percentages of the contents, 
CuO and FeO, etc., in a mining can be directly added up without a 
worry about the conjunction issue although the substances contain 
the same “O". The disjunction between CuO and FeO is because 
the oxygen atom involved in CuO is not the same oxygen atom 
involved in FeO when we look at these atoms individually, although 
they are labeled the same as *O". 

As presented above, the concepts on the pattern expressions, the 
applicability of the equilibrium condition, the selective pattern gen- 
eration mode, and the pattern disjunction property, they together 
would form the basic theory of the plurality case pattern mining. 
At the same time, these concepts are interdisciplinary conformable 
with other established theories, namely, probability, statistics, chem- 
istry, and the list can be extended. This promises a realization of the 
application of pattern mining into computational chemistry [51]. 

6) We notice that this paper is only an initial part ofthe pursuance 
of the establishment of the fundamental pattern mining theory, 
which is a big work and could not be fulfilled with one paper only. 
For instance, more formal definitions and measures on overfitting 
and underfitting may be desired; the reliability theory and measures 
on mining methods and mining results are definitely in need, and 
the implementation of the selective generation mode is another 
major work. These are what we strive for in our future work. 


6 CONCLUSIONS 


Thousands of research papers related to pattern mining have been 
published so far, yet the goal of reliable real world pattern mining 
is still far to reach, since the simplest classic case mining has not 
been well pursued as revealed in this paper. The basic reason for 
this is the lack of well-established mining reliability theory and 
criteria for different mining approaches to comply with. This paper 
reexamines the two reliability determinants, the support s(Z) and 
the full enumeration pattern generation mode generally used by 
previous approaches. 

Traditionally s(Z) is the only generally used criterion to evaluate 
a pattern, yet this measure is ill defined and causes serious proba- 
bility anomaly. The full enumeration mode produces excessive and 
unrealizable patterns. Based on the investigation and analysis of 
the theoretical fallacies of previous approaches, a theoretic solution 
has been derived and proposed in this paper. This includes the refor- 
mulated s'(Z) and the three fundamental rationality maintenance 
criteria that every mining approach should observe: 

(1) the equilibrium condition, 

(2) probability anomaly free, and 

(3) the use of the feasible selective pattern generation mode. 

Notice that no previous approach did or could claim the satisfac- 
tion of the above criteria. A natural conclusion is then, no previously 
proposed approach or algorithm, however efficient, could achieve 
a reliable pattern mining. 

The s'(Z) and the three new criteria added in will certainly 
improve the rationality of the mining theory and the reliability of 
the mining results. A direct outcome of the improvement is the 
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great reduction of the number of result patterns (from the power 
set to less than linear) in the classic case mining without using 
any exogenous measure or constraint. The theory and solution 
proposed in this paper is thus featured with simplicity, rationality 
and effectiveness. These features are undoubtedly important for the 
rising big data science. These merits together imply a revolutionary 
change towards a more effective and more reliable pattern mining. 

This paper, however, is only an initial work for peers to discuss 
and ultimately pursue a full set of the pattern mining theory. Further 
work such as the implementation of the selective pattern generation 
mode, the mining reliability theory, and so on, are major tasks and 
wait us to fulfill. 
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ABSTRACT 


A vital element of a cyberspace infrastructure is cybersecurity. 
Many protocols proposed for security issues, which leads to 
anomalies that affect the related infrastructure of cyberspace. 
Machine learning (ML) methods used to mitigate anomalies 
behavior in mobile devices. This paper aims to apply a High- 
Performance Extreme Learning Machine (HP-ELM) to detect 
possible anomalies in two malware datasets. Two widely used 
datasets (the CTU-13 and Malware) are used to test the 
effectiveness of HP-ELM. Extensive comparisons are carried out 
in order to validate the effectiveness of the HP-ELM learning 
method. The experiment results demonstrate that the HP-ELM was 
the highest accuracy of performance of 0.9592 for the top 3 features 
with one activation function. 
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1 Introduction 


The popularity of smartphones is growing recently. According to 
the Gartner, the total number of smartphone’s user will reach six 
billion by 2020 ( according to Ericsson [1] ), also more than 40 
million attacks by malicious mobile malware ( according to 
Kaspersky Labs [2] ), and a data breach will exceed $150 million 
by 2020 (according to Juniper research [3]). Smartphone devices 
are affected by malicious malware which installs their modules in 
the system directory and attempts to subvert the typical system’s 
behavior [4]. 


To cope with these problems, researchers and security analysts 
conducted a study for practical or scientific purposes to establish 
reliable applications for mobile devices. In [5], Russon discovered 
various types of hidden malware in more than 104 Google play 
applications which downloaded over 3.2 million times. It causes 
numerous attacks to user’s mobile devices to affect the CPU loads 
for the system [5]. 


Several security protocols applied for malware detection. Such 
systems can be either anomaly-based or behavior-based. The 
former relies on a predefined pattern. This type of Intrusion 
Detection System (IDS) is an efficient approach for identifying 
sweeps and probes of network hardware and hints early warnings 
of potential intrusions before firing attacks such as Telnet. Such 
type systems depend on receiving regular signature updates (1.e., 
the extent of the signature database). The dramatic influence of 
DDoS attacks by Mirai bot net and its variants highlights the risks 
for IoT devices [6]. An end to end security scheme called Datagram 
Transport Layer Security (DTLS) protocol utilizes an encryption 
technique [7]. Although DTLS mitigate the specific type of attacks, 
it fails to identify all type of big data based anomalous behaviors. 
The significant drawback is that they are attack-specific [8]. 
Therefore, it is essential to develop methodologies and procedures 
to measure the uncertainty of IoT devices and its potential capacity 
to make a smart decision in order to increase the efficiency of 
security and privacy issue in IoT environments. It is also essential 
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to provide smart recommendation in terms of fraud or vulnerability 
detection using leaning algorithms. 


Soft computing techniques such as neuro fuzzy have been proposed 
to mitigate the security issue in IoT based malware detection [9]. 
Neuro-Fuzzy classifiers trained from static data and data that 
applications generate during their execution. Neuro-Fuzzy has a 
potential in malware detection based on collected statistics and 
derived fuzzy rules ([10], [11], [12]). Moreover, several learning 
algorithms have been proposed to identify the malware codes and 
their behaviors and identify the specific type of threats ([13],[14]). 
The main drawback of soft computing technique such as neuro 
fuzzy is that the fuzzy rules are randomly generated to tune the 
weights of the neural network and the number of hidden layers can 
increase depending on the type of application. However, there is no 
way to speed up the procedure oftuning. Thus, ELM can adjust the 
activation functions in the Single Hidden Layer Feed-forward 
Neural Networks (SLFNs) ([15] , [16] , [17]) and [18]. 


In this paper, we utilized a fast convergent reinforcement solution 
named High Performance Extreme Learning Machine (HP-ELM) 
to help the learning phase of the method to call the parameters of 
the hidden neurons that created randomly, which is independent of 
the training data [19]. Furthermore, such a method is used to test 
various scales of data sets, different structure selection options, and 
regularization methods. In our study, HP-ELM is applied to classify 
the malware in two datasets. 


In our study, we deal with the following research questions: 1. 
“What are the current data analytic techniques that are being used 
to extract meaningful IoT based malware devices values?", 2. 
*How to maintain malware influence on IoT?" and 3. *What is the 
effect of the proposed security and privacy preserving framework 
in terms of scalability in the big data platform?". Our work 
contributes to forensic malware behaviour. Previous results and 
datasets of mobile malware applications used in this study ([20] 


21). 


The contribution of the paper is to propose an IDS method to 
recognize IoT malware as follows. We applied a feature selection 
method in two malware datasets. We developed a malware 
detection system for IoT environment based on HP-ELM classifier. 
We tested the efficiency of the proposed system using two 
benchmark datasets: CTU-13 dataset [20] and DyHAP malware 
dataset [21] and we compared the performance of HP-ELM with 
and without feature selection. 


The manuscript organized as follows. The previous works 
discussed in Section 2. Section 3 presents the HP-ELM detection 
method and Section 4 discusses data preparation. Section 5 presents 
scenarios and the setting of the HP-ELM parameters. Section 6 
presents the simulation setup and evaluation metrics — 
experiments presented in Section 7. Section 8 concludes the paper 
and presents future directions. 


2 Previous work 


This section presents technical papers which use the security 
framework for malware based IoT [22]. In [22], an application 
program is tested in a scrutinizing manner without the 
implementation of the actual application (i.e., a reverse engineer 
process) while in [23], a program investigates the behavior of the 
running processes by executing the application. On the one hand, a 
static type malware program requires low memory resources, 
minimal CPU processes and the analysis process is fast, on the 
other hand, a dynamic one could be used to detect unknown 
changes and malware existence [23]. A machine learning based 
malware detection proposed in [24] which used a private and a 
public dataset. Finally, they validated their solutions in various 
cases. In our paper, we also evaluated the proposed methods in a 
private (Andoird Malware) and a public (CTU-13) dataset using a 
different type of HP-ELM parameter setting experiments. 


Authors in [23] use a crowdsourcing system in order to attain the 
flows of the application’s behavior. Many studies have been carried 
out on mobile phone malware based on a single operating system 
or a comparative study between two operating systems. Authors in 
[22] present a multilevel and behavior-based Android malware 
detection using 125 existing malware families and report 96% 
detection of malware. However, this approach applies system calls 
which contains less semantic information and is not able to detect 
malicious behavior accurately. Recently, authors in [25] captured 
system calls and "binder transactions" which runs the runtime 
behavior signature. This method presents a new malware variant 
detection client-server system which jointly covers the logic 
structures and the runtime behaviors of mobile applications for 
Android devices. However, this approach depends on the network 
status, graph mining; it impacts network performance, which 
directly influences the computation complexity. Therefore, it is not 
suitable for real-time detection. 


In [26] authors present a lightweight detection system to identify 
malicious behaviors from mobile devices. Statistical Markov chain 
models applied to build the application behavior in the form of a 
feature vector and the random forest is adopted to classify the 
application behavior. The results indicate an accuracy of 96%. 


Mirai attempts to categorize remarkable DDoS attacks affecting 
high profile targets [6]. It is a wake-up call for the control in IoT 
devices and analyzes the risk of increasingly DDoS attacks. It 
diminishes the administrative credentials of IoT devices using brute 
force, relying on a small dictionary of a possible username and 
password pairs [27]. To cope with such issues, we need resource 
efficiency hybrid IDSs as novel security solutions that can 
efficiently protect devices against DDoS, considering the 
insufficient resources available in the IoT environment. 


3 HP-ELM detection method 


The main drawback of soft computing technique is that the fuzzy 
rules are randomly generated to tune the weights of the neural 
network and the number of hidden layers can increase depending 
on the type of application. However, there is no way to speed up 
the procedure of tuning. Thus, ELM able adjust the activation 
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functions in the single hidden layer feed-forward neural networks 
(SLFNs) ([15]. [16] , [17]) and [18]. 


This section presents the HP-ELM mobile network architecture 
applied on two malware datasets Android Malware [21] and CTU- 
13 dataset [20]. As mentioned in Figure 1, ELM is fast training 
method for SLFN networks[28]. 


Output Layer 


Hidden Layer 


Input Layer 


Figure 1. Computing the output of an SLFN (ELM) model 


Figure 1 shows the computing of output of an ELM model. It 
includes three layers of neurons. There is no computation in layer 
one (input). The input layer weights w and biases b are set 
randomly and never adjusted (random distribution of the weight). 
The output layer is linear and there is no transformation function 
and bias for output layer. Therefore, the computing time is very 
fast. The word “Single” in “SLFN” is because there is only one 
layer of non-linear neurons (hidden layer). The main advantage of 
ELM is to produce weakly connected hidden layer features, 
because the input layer weights randomly generated, and it 
improves the generalization properties of the solution of a linear 
output layer [19]. 


The ELM method described as follows. We consider a set of N 
distinct training samples (xj,t;) , i € [1, N] with x; € R^ and t; € 
R*. A SLFN with L hidden neurons has the following output 
equations: 


ji Bj (w;xi + bj), 


where @ is the activation function (a sigmoid function is a common 
choice, but other activation functions are possible including linear 


i € [1 N], (1) 


[17], [18] and [28], w = [wa wzw] is the weight vector 
that connects the n input nodes to the jth hidden node, b; are the 
biases values of the jth node. 


A hidden node and fj; = [Hsu] is the set of values of the 
output weights that connects the jth hidden node with m output 
nodes. The relation between inputs x; of the network, target 


outputs t; and estimated outputs y; is: 


yi —^ XL BjO(w;x; t b;) = tj t €i, iE [1, N], (2) 
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where Ø is the activation function and w; is the weight vector that 
connects then input nodes to the jth hidden node, b; the biases 
values of the jth node. The noise (€) includes both random noise 
and dependency on variables not presented in the inputs X. 


For N samples, the N equations represented as HB=T, where 
H = (wi, Wa, .., Wi by, bo, +) D X4, X2, + XN) (3) 
(oe + bi) Qwy. x4 + u 


Ø(w1. Xy + bi) Qwy. XN + by) N«k 


B = [BT -Bk Neem (4) 
T= [vf Vel Nom (5) 


The output weight matrix P found by solving the least square 
problem: 


B = ming||HB — TII = H*T, (6) 
Ê = (HTH)H'T (7) 


where Ht is the MP pseudo inverse of the hidden layer output 
matrix H. 


This paper used HP-ELM toolbox which supports multi-class, 
weighted multi-class and multi-label as a classifier [19]. Section 4 
describes how HP-ELM adapted to the two malware scenarios. 


4 Dataset preparation 


In this study, the HP-ELM methods evaluated on two scenarios. 
The first scenario uses the Mobile Malware dataset and the second 
scenario uses the CTU-13 dataset. We will describe the method of 
data collection for these two scenarios next. 


4.1 Mobile Malware dataset (Scenario 1): 


This scenario consists of two types of applications such as benign 
and malware. Twenty normal apk file downloaded from Google 
Play. These benign files installed on an Android-based operating 
system Jelly Bean version 4.3 which runs on mobile devices. 


After installation, the network traffic of running apps captured in a 
real time network environment in order to authenticate the behavior 
of apps. In the case of malware, the experiment utilizes Malgenome 
[29] as the malware dataset. It contains 1260 samples which consist 
of 49 malware families used in the previous study [21]. The 
identification includes several malware types, such as a botnet, root 
exploit, and Trojan. Authors selected seven connection-based 
features on the Information Gain (IG) [30] algorithm to analyze it 
because of its practical measuring features [31]. IG shows their 
strength in accuracy enhancement, the capability of generalization 
and short execution time. It determines how the training sets 
separated according to the target classification [32]. In Scenario 1, 
the higher gain ratio indicates the feature's relevance in a 
classification model for a machine learning classifier. Therefore, 
the features are maximum frame, frame STD, count ACK, 
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Minimum Dest Port, Average Frame, and Average source port 
[21]. Figure 2 shows the methodology of collection of mobile 
malware dataset in Scenario 1. 
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Figure 2: The methodology of collection of Mobile Malware data 
4.2 CTU-13 dataset (Scenario 2): 


This dataset consists of thirteen captures (called Scenarios) of 
different real botnet samples [20]. The characteristics of the 
scenarios and the features (pcap files) captured in terms of 
tcpdump. The pcap files are converted to netflow file standard using 
the argus software suite in two steps. The first step converts the 
pcap files to a bidirectional argus binary storage. The second step 
converts the argus bin to Netflow. The outcome ofthese steps is the 
final netflow file [33]. The next step is to assign the label to the 
netflow data. The background label assigns the normal label to the 
traffic which matches a specific filter. Then the botnet label is 
assigned to the traffic that comes from or to any of the known 
infected IP address [20]. Figure 3 indicates the methodology of 
collection of CTU-13 and how the HP-ELM classifies it. 
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Figure 3: The methodology of collection of CTU-13 


In this research work, the labeled dataset splits into training (70%) 
and testing data (30%) [34] to evaluate with the HP-ELM 
algorithm. We also evaluated our methods with different training 
and testing data such as 80% training and 20% testing, and we 
found that the best result reached is by 70% training and 30% 
testing. Thus, we focus only on the case with training (70%) and 
testing data (30%). In the next section, we propose the procedure 
of applying the method to the datasets. 


5 Proposed method 


The proposed method includes a few modules. In detail, the first 
module (Reading) is used to read the malware datasets continuously 
and transfer them to feature selection modules. Then, the second 
module (Filtering) is responsible for normalizing the data and 
applying the feature selection algorithm to select the essential 
features for the proposed algorithm. After that, the third module 
(Splitting) is responsible for splitting the data into 70% training and 
30% testing. Then, the fourth module (HP-ELM) is adapted to tune 
its parameters to predict the targets. Finally, the fifth module 
(Evaluation) is responsible for predicting the actual values in the 
training and testing phase. 


5.1 Feature selection 


Before applying the HP-ELM method, we utilized the F-Score [35] 
and Fisher Score [36] in order to select the most valuable features. 
Then, the HP-ELM builds to new data. The extracted features from 
CTU-13 dataset are: {DstAddr, sport, Proto, SrcAddr, Dport, Dur, 
State, sTos, dTos, TotBytes, SrcBytes and TotPkts}c [37] . More 
details about the features of CTU-13 mentioned in [20]. 


52 HP-ELM parameter setting (activation 
functions, number of neurons, etc) 


We applied six types of activation functions in the layer of HP- 
ELM such as linear, rbf-linf, rbf-11, rbf-11, tanh, and sigmoid. A 
total of 2000 of neurons applied in the layer of HP-ELM. We 
investigated HP-ELM with feature selection (Top 3, and top 5 
features) and without feature selection (all features). We also 
applied different strategies for activation functions in the layer of 
HP-ELM (one, two, three and four activation functions) in both 
datasets. 


6 Performance Evaluation 


The following subsections describe in detail the pursued datasets, 
evaluation metrics, scenarios and their settings, and the results for 
the applied scenarios. 


6.1 Simulation Setup 


The methods tested on a system with Intel Core i7CPUand 8-GB 
RAM. Two benchmark malware datasets CTU-13 dataset (Scenario 
5) [20] and DyHAP malware dataset [21] have been used to 
evaluate the performance of HP-ELM. 


6.2 Metrics 


This research used the accuracy ratio of Equation 8 to evaluate the 
performance of the method. In general, the accuracy is the number 
of applications which the classifier correctly detects, divided by the 
total number of malicious and legitimate applications. The 
accuracy is between 0 and 1, i.e. 0 € Accuracy € 1. We also 


TP+TN 
L————— (8) where 
TP+FN+TN+FP 


the used parameters denote the following numbers. TN is the 
accurately classified benign instances; TP is the malicious 
applications that are appropriately identified, FP is the wrongly 


have Accuracy Ratio = 
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classified benign instances as malware applications; and, FN is the 
malware instances wrongly classified as a legitimate application. 


7 Results and discussion 


We evaluated the results using various layers with/without the 
feature selection. 


7.1 Evaluate HP-ELM without applying feature 
selection in android malware 


This subsection compared the performance of the HP-ELM in the 
presence of various active functions for two datasets without using 
feature selections and integrated all features. A total number of 13 
features are available in CTU-13 dataset, and the number of total 
input features for Android Malware dataset is six. Figure 4 (a, b) 
indicates the accuracy of HP-ELM ELM in training (70%) and 
testing (30%) with and without feature selection with one, two and 
three activation function for android malware dataset. The x-axis 
represents the activation function’s name. For example, the order 
of one activation function with all features are linear, Rbf 11, Tanh, 
Sigmoid, Rbf 12 and Rbf linf which are mapped to the tag point 
(0.0, 0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0). It is obvious that HP-ELM 
reach a high accuracy of 0.9696 in training and 0.9672 in testing 
with 2000 neurons of Rbf-linf activation function without applying 
the feature selection. The result indicates that the accuracy of HP- 
ELM increased with two activation functions (Rbf linf(1000), 
Sigmoid(1000)). Finally, the accuracy of HP-ELM with three 
activation functions Sigmoid (400), Tanh (1000), Rbf linf(600), 
and a total number of 2000 neurons reach 0.9679 in training and 
0.9661 in testing. 
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Figure 4 (a). Accuracy of HP-ELM without feature selection with one 
activation function for android malware dataset 
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Figure 4 (b). Accuracy of HP-ELM without feature selection 
with Two activation function for android malware dataset 


7.2 Evaluate HP-ELM by applying feature 
selection (Top 3 features) on android malware 

This subsection compares the performance of the HP-ELM in the 
presence of various active functions for the android malware 
datasets with three higher priority features using F-Score selection 
policy. The goal of this scenario is to analyze the accuracy rate of 
the HP-ELM method in the presence of different activation 
functions using various selected features. In Figure 5 a and b, we 
have numerically tested this, by applying the top 3 features, the 
accuracy of HP-ELM in terms of one activation function 
Rbf_11(2000) reach a high value of 0.9056 in training and 0.9017 
in testing. On the other hand, a combination of two activation 
functions such as Tanh (1000), Rbf_11(1000) Hp-elm gives a better 
accuracy ratio of 0.9018 in testing. Finally, the high accuracy 
reached with three activation function Sigmoid (400), Tanh (1000), 
Rbf linf(600) in testing 0.8998. It can be considered the best 
strategy rule of a two-activation function. Figure 5a, and 5b indicate 
the accuracy of HP-ELM in android malware scenario. 
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Figure 5 (a). Accuracy of HP-ELM with top 3 features and two 
activation function for android malware dataset 
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Figure 5 (b). Accuracy of HP-ELM with top 3 feature with three 
activation function for android malware dataset 


7.3 Evaluate HP-ELM with applying feature 
selection (Top 5 features) on android malware 


This subsection compares the performance of the HP-ELM in the 
presence of various active functions for the same layer for android 
malware datasets with five higher priority features using F-Score 
selection policy. The goal ofthis scenario is to analyze the accuracy 
rate of the HP-ELM method in the presence of different activation 
functions using various selected features. In Figure 6 a and b, we 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 281 


have tested the model by applying top 5 features. The accuracy of 
HP-ELM in terms of one activation function Rbf_linf (2000) reach 
to the high value of 0.9675 in training and 0.9656 in testing. 
However, the testing result of Rbf_11 (2000) is higher than Rbf_linf 
(2000), which is 0.9702. The result of a combination of two 
activation functions Rbf linf (1000), Sigmoid (1000), Hp-elm 
gives a better accuracy ratio of 0.9673 in training and 0.9654 in 
testing, but the combination of two activation functions of Rbf 11 
(1000), Sigmoid (1000) reaches a high accuracy in testing phase 
which is equal to 0.9689. The combination of three activation 
functions such as Sigmoid (400), Tanh (1000), Rbf linf(600), gives 
the accuracy of 0.9668 in training and 0.9661 in testing which is 
not very high in comparison with the cases of two and one 


activation functions. 
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Figure 6 (a). Accuracy of HP-ELM with top 5 features and one 

activation function for android malware dataset 
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Figure 6 (b). Accuracy of HP-ELM with top 5 features and two 
activation function for android malware dataset 


In Figure 6, we confirm that when the number of activation function 
increases, the accuracy will also increase, and its value will remain 
constant after 1000 neurons in datasets. By increasing the number 
of activation (two) functions, the highest accuracy reached in the 


testing phase. 


7.4 Evaluate HP-ELM without applying feature 
selection on CTU-13 


HP-ELM applied to CTU-13 in training and test cases. For 
example, with one activation function (Rbf_linf) with 2000 neurons 
reaches a high accuracy of classification of 0.9625 in testing. On 
the other hand, the higher accuracy with two activation functions 
rbf_11(1000), rbf_linf(1000) is 0.9622 in the testing phase. By 
increasing the number of activation functions to three, the result 
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does not reach as high accuracy in comparison with one and two 
activation functions. The highest accuracy with three activation 
functions is 0.9616 which is lower than 0.9625 obtained with one 
activation function and 0.9622 obtained with two activation 
function. According to Figures 7 (a) ,7 (b) and 7(c) for the CTU-13 
dataset, it is confirmed that more activation functions provide more 
complexities. It observed that the rate does not increase. Moreover, 
when we reach to the best number of neurons, increasing activation 
functions does not affect the accuracy in comparison with the case 
of one activation function in the layer of HP-ELM. 
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Figure 7 (a). Accuracy of HP-ELM without feature selection and one 
activation function for CUT-13 


Train&Test Data All Feature Two Activation function 


0.964 


0.962 


0.960 


Accuracy 


0.958 


—— Train data 


0.956 | -e- Test data 


o 2 4 6 8 
Point Tag.. 


Figure 7 (b). Accuracy of HP-ELM without feature selection and Two 
activation function for CUT 
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Figure 7 (c). Accuracy of HP-ELM without feature selection and Three 
activation function for CUT-13 
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Figure 7 (d). Accuracy of HP-ELM w 
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7.5 Evaluate HP-ELM with applying feature 
selection on CTU-13 (Top 3) 


In this section, the feature selection strategy applied to CTU-13. 
Therefore, three high priority features are selected based on the F- 
score algorithm, then the selected feature allows hp-elm to evaluate 
the performance of the method. As in previous sections, various 
activation functions with a total number of 2000 of neurons applied 
to CTU-13. On the other hand, the highest accuracy with one 
activation function (Rbf-linf) reaches 0.9592 in the testing phase. 
Figure 8 (a,b,c) shows the accuracy of hp-elm with the top 3 
features and different activation functions. 
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Figure 8 (a). Accuracy of HP-ELM with feature selection and one 
activation function for CUT-13 
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Figure 8 (b). Accuracy of HP-ELM with feature selection and two 
activation function for CUT-13 
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Figure 8 (c). Accuracy of HP-ELM with feature selection and three 
activation function for CUT-13 
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Figure 8 (d). Accuracy of HP-ELM with feature selection and four 
activation function for CUT-13 


7.6 Evaluate HP-ELM with applying feature 
selection on CTU-13 (Top 5) 


In this scenario, five high priority features are selected based on the 
F-score algorithm. By applying one activation function with a total 
number of 2000 of neurons, high accuracy of 0.9589 in Rbf_linf 
(2000) is reached in the testing phase as shown in Figure 9 a-d. 


The second-highest accuracy with two activation function of 
rbf_linf(1000), linear(1000) is 0.9590 in the testing phase. On the 
other hand, the third highest accuracy with three activation function 
rbf_11(666), rbf_112(667), rbf_linf (667) is 0.9588 in testing phase. 
By increasing the number of activation functions to four, the 
accuracy of 0.9585 is the same in the testing phase for the activation 
function of 1) Tanh(500), sigm(500), rbf_11(500), rbf_linf (500) 2) 
sigm(500), rbf 11(500) rbf_12(500), linear (500) and 3) 
rbf 11(500), rbf 12(500),  rbf linf (500) linear (500), but the 
training phase of Tanh(500), sigm(500), rbf 11(500), rbf linf(500) 
is better than the rest of the activation functions. 
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Figure 9 (a). Accuracy of HP-ELM with feature selection and one 
activation function for CUT-13 
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Figure 9 (b). Accuracy of HP-ELM with feature selection and two 
activation function for CUT-13 
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Figure 9 (c). Accuracy of HP-ELM with feature selection and three 
activation function for CUT-13 
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Figure 9 (d). Accuracy of HP-ELM with feature selection and four 
activation function for CUT-13 


7.7 Comparison of results 
For the evaluation of a proposed model, the key point is to show 


that the HP-ELM model is producing the smallest error in testing 
datasets or the highest accuracy of classification. As Table 1 shows, 


the accuracy values for malware dataset without feature selection 
are rather high with two activation function Rbf_linf (1000), and 
Sigmoid (1000). From Table 2 with feature selection, it is apparent 
that the precision is rather high in terms of two activation function 
of Rbf 11 (1000) and Sigmoid (1000) compared with other metrics. 
The outcome of Table 3 and 4 indicate that the higher accuracy of 
HP-ELM in CTU-13 without feature selection is 0.9625 in the 
testing phase which is related to one activation function 
Rbf linf(2000). On the other hand, the best accuracy of feature 
selection based on HP-ELM for top 3 is 0.9592 which is 
Rbf linf(2000) and the higher accuracy for top 5 feature in testing 
is 0.9590 which is rbf_linf(1000), linear(1000). 


Table 1: Accuracy of HP-ELM without feature selection in 
malware dataset 


Without feature selection 


All features 
Train Test 
Rbf linf(2000) 0.9696 0.9672 
Rbf linf(1000), 0.9689 0.9673 
Sigmoid(1000) 
Sigmoid (400), 0.9679 0.9661 

Tanh (1000), 
Rbf linf(600), 


Dataset #Activation function 


1 activation function 
2 activation function 


3 activation function 


Malware 


Table 2: Accuracy of HP-ELM with feature selection in 
malware dataset 


With feature selection 
Data | Act- func Top 3 Act-func Top 5 
88 S $8 

o | RbETM Rbf linf 
S | (2000) «c | m | (2000) 218 
8 $8 F2 

5 o | Rbflt Rbf I1 
S | S | (1000), $ | 2 | (1000), S18 
g S Sigmoid 2 | 2 | Sigmoid = | 2 

(1000) (1000) 

e | Sigmoid Sigmoid 
= | (400), Tanh 3 E: (400), Tanh S e 
& | (1000), 2 | & | (1000), 2 2 

i Rbf linf(600) Rbf linf(600) 


The Table 3 and 4 reports summarize information on the total 
number of evaluations in malware dataset which gained accuracy 
for HP-ELM in two experiments, with and without feature selection, 
which specifically focus on the number of activation functions. In 
the feature selection experiment, the total numbers of activation 
functions remained the same, but there was a significant difference 
in the accuracy of classification by increasing the number of feature 
selection. For instance, the accuracy of testing (0.9689) in the top 5 
features is higher than the accuracy of classification (0.9018) with 
the top 3 feature selection in the same activation function and the 
number of neurons. As shown in Table 2, in malware experiments 
with three activation functions, out of a total of 2000 neuron, the 
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percentage of accuracy in top 5 feature was higher. In general, it 
can be seen that by increasing the number of activation functions 
without feature selection strategy, the percentage of accuracy is 
gradually dropping. On the other hand, by selecting more features 
and activation functions, the accuracy of classification increases 
moderately. 


Table 3: Accuracy of HP-ELM without feature selection in 
CTU-13 dataset 


#Activation Without feature selection 
function 
Datas Train Test 
et 
1 activation Rbf_linf(2000) 0.965 0.9625 
function 
2 activation rbf 11(1000), rbf_linf(1000) 0.9642 0.9622 
2 function 
E 3 activation Tanh (666), sigm(667), 0.9642 0.9616 
function rbf linf(667) 
4 activation sigm(500), rof 11(500), 0.9633 0.9611 
function rbf 12(500), rbf linf (500) 


Table 4: Accuracy of HP-ELM with feature selection in CTU- 
13 dataset 


With feature selection 
#Activation function Top 3 #Act function Top 5 
CTU-13 E E: s | 8 
£ 
$ rbf. lin z E: alt 
S f(2000) Z 2 | rbflinf(2000) | z | S 
Z S E Ss 
© 
z rbt 11 rbf linf(1000) 
$ (1 000), co co I W e 
S rbf_linf(1000) e $ | linear(1000) | 8 | 8 
= D o © © 
E e ce e ce 
N 
- rbf 11(666) 
S Sigm(666), rbf 112(667) 
S | rbf_11(667), 3 S | rbflinf(e67) | S | 8 
= | rof linf(667) Z 2 SB. |e 
e 
z Tanh (500) 
$ Tanh (500), 2 e, | Sigm(500) ENDA, 
= sigm(500), S B rmt) | & | 8 
P rbf_11(500), E S | rbflinf(500) | S | 5 
= rbf_linf (500) 


Table 3 and 4 gives information on the accuracy of classification of 
HP-ELM for two experiments, with and without feature selection 
of CTU-13 dataset. In the feature selection experiment, HP-ELM 
was the highest accuracy leader with a low number of feature 
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selection, and it is about 0.9592 for the top 3 features with one 
activation function. 


8 Conclusion and future discussion 


In this paper, we exploit HP-ELM to improve prediction stability 
for the training of (SLFNs). The presented model optimizes the 
input weights and hidden layers and provides more consistent 
performance in comparison to the other training models. Actual 
performance of the HP-ELM classifier tested under two real-world 
datasets, with various activation functions, neurons, layers, and 
different feature selections. The simulation results show that HP- 
ELM is a fast training method that reduces the root mean square 
error near to zero, it approaches accuracy ratio to one, and it 
achieves the feature optimization combination, and it also provides 
an excellent generalization performance on an SLFN and 
establishes a network intrusion detection model with the best 
overall performance. This finding, however, shows that despite 
promising results obtained by using the proposed model for this 
case study, it improved and further studies, which take more 
variables into account, will need to be undertaken. 
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ABSTRACT 


Pattern mining aims to discover from data the implicit, previously 
unknown and potentially useful information and knowledge in the 
form of patterns. Over the past 20 years, numerous pattern mining 
algorithms have been proposed. They focused on algorithmic 
efficiency, functionalities, and other aspects. These algorithms 
have been applied to various real-life applications running in serial, 
parallel, and/or high-performing computing environments. In this 
paper, we review many existing pattern mining algorithms and 
suggest some pattern mining algorithms—especially, hybrid 
vertical frequent pattern mining running in serial, parallel, high- 
performing computing, and/or edge/fog environments—to discover 
knowledge from dataset. 
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1 Introduction 


Data mining [1,2] aims to discover implicit, previously unknown 
and potentially useful information and knowledge from data. As a 
popular data mining task, pattern mining discovers knowledge in 
the form of patterns (e.g., discriminative patterns [3], ‘following’ 
patterns [4], frequent patterns/itemsets [5,6], frequent subgraphs 
[7], periodic patterns [8], sequential patterns [9]). For instance, 
frequent pattern mining (aka frequent itemset mining) finds 
frequently occurring patterns such as sets of frequently co- 
occurring items or events. Frequent pattern mining can be served 
as either a stand-alone mining task or a pre-processing step for other 
data mining tasks (e.g., association rule mining, sequential mining, 
associative classification). For instance, frequent patterns mined 
from shopper market transactions reveal baskets of popular 
merchandise items purchased by customers. Similarly, frequent 
patterns mined from web logs reveal collections of frequently 
visited webpages. In addition, the mined frequent patterns can be 
served as building blocks for other data mining tasks like 
association rule mining [10], which aims to discover association 
rules of the form AC for revealing association relationships 
between frequent patterns A and C. Similarly, sequential mining 
aims to find temporally frequent sequences of frequent patterns. 
Moreover, associative classification aims to classification rules of 
the form A>L for classifying frequent pattern A with a class label L. 

Over the past 20 years, frequent pattern mining has drawn 
attention of many researchers, who developed numerous frequent 
pattern mining algorithms. Many of the early frequent pattern 
mining ones were serial algorithms—such as Apriori-based [11], 
which depend on a generate-and-test paradigm to mine frequent 
patterns from transaction datasets by first generating candidates and 
then checking their actual frequency (1.e., occurrences) against the 
dataset. To improve algorithmic efficiency, other serial frequent 
pattern mining algorithms—such as tree-based frequent pattern 
mining algorithms (e.g., FP-growth [12]), hyperlinked array based 
frequent pattern mining algorithms (e.g., H-mine [13]), and 
bitwise-based frequent pattern mining algorithms (e.g., B- 
mine [14])—have been developed. Besides the transaction-centric 
algorithms that mine the datasets "horizontally", there are also 
item-centric algorithms—such as Eclat[15], dEclat [16], and 
VIPER [17]—that mine the datasets “vertically”. Depending on 
factors like data density, one of these item-centric frequent pattern 
mining algorithm can run faster than another. 

Besides the aforementioned serial algorithms for the discovery 
of frequent patterns, there are also distributed and parallel frequent 
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pattern mining algorithms [18-21]. They use multiple processors 
that either (a) have access to a shared memory with which 
information is exchanged among processors via Open Multi- 
Processing (OpenMP) or (b) have their own private memory (i.e., 
distributed memory) with which for information is exchanged (1.e., 
messages are passed) among processors via Message Passing 
Interface (MPI). These processors are usually in a computer cluster 
(which is a group of distributed or parallel computers that are 
interconnected through high-speed local area networks (LAN) or 
worked together as a single computing unit to mine the data) or a 
computer grid (which is a loosely coupled group of coordinated 
heterogeneous networked computers—in which each computer 
may perform a different mining process) [22]. 

Advancements in technology enable easy generation and 
collection of huge volumes of valuable data (which may be of 
different veracity level) at a high velocity from a wide variety of 
data sources such as Internet of Things (IoT) devices. Data science 
is in demand for (a) discovering knowledge from these big data, 
and for (b) visualizing, validating and interpreting these big data 
and the mined results (i.e., knowledge discovered from these big 
data). In recent years, frequent pattern mining algorithms use the 
concept of MapReduce programming model [23] implemented in 
the Apache Hadoop or Apache Spark framework [24] to mine 
frequent patterns in a cloud computing environment [25]. Here, as 
implied by its name, MapReduce uses (a) the map function to 
transform each value in an input list to a mapped value in an output 
list and (b) the reduce function to combine values in an input list to 
a single reduced value as an output. By do so, the data miner only 
need to focus on specifying the map and reduce functions—without 
worrying about implementation details on how to handle machine 
failures, manage inter-machine communication, partition the input 
data, and/or schedule and execute the program across multiple 
machines. Apache Hadoop relies on the Hadoop Distributed File 
System (HDFS) to store data on commodity machines and provide 
high aggregate bandwidth across the computer cluster. In contrast, 
Apache Spark relies on a read-only multiset of data items—called 
resilient distributed dataset (RDD)—distributed over a cluster of 
machines and maintained in a fault-tolerant fashion. Cloud 
computing utilizes a network of remote servers that are hosted on 
the Internet—such as cloud data center or cloud computing services 
like Amazon Web Services (AWS)—for the storage, management, 
processing, and analysis of data. Different types of clouds (e.g., 
public, private, hybrid clouds) involve groups of interconnected 
and virtualized computers to provide on-demand services such as 
mining-as-a-service (MaaS) [26]. While efficient, many of these 
high performance computing (HPC) based frequent pattern mining 
algorithms transmit data to clouds for mining. For example, 
MREclat [27] and Dist-Eclat [28] are MapReduce versions of Eclat 
and dEclat, respectively, whereas BigFIM [28] is a MapReduce 
version of a hybrid of the Apriori and dEclat algorithms. 

To reduce the amount of data transmission and thus to speed up 
the mining process, frequent pattern mining can be performed by 
using edge computing aka fog computing [29-31]. The idea is to 
store, manage, process and analyze data in the fog or on the edge 
devices. In other words, the computation is performed on local IoT 
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devices or the networking services that are located in between the 
local IoT devices and the data cloud centers. In this paper, we focus 
on vertical mining (i.e., item-centric frequent pattern mining). 
Recall that the runtimes of item-centric frequent pattern mining 
algorithms vary depending of factors like data density. Our first key 
contribution of this paper is hybrid algorithms, which switch 
among Eclat [15], dEclat [16] and VIPER [17] algorithms in order 
to take the benefits of the three worlds. Our second key 
contribution of this paper is edge/fog computing based vertical 
mining algorithms, which reduce the amount of data transmission 
and thus speeds up the mining process. 

The reminder of this paper is organized as follows. The next 
section gives background and related works. Section 3 presents our 
algorithms. Section 4 briefly discusses the evaluation results of our 
feasibility studies. Finally, conclusions are drawn in Section 5. 


2 Background and Related Works 


The classical Apriori algorithm [11] applies a generate-and-test 
paradigm in mining frequent patterns in a level-wise bottom-up 
fashion. Specifically, the algorithm first generates candidate 
patterns of cardinality k (i.e., candidate k-itemset) and then tests if 
each of them is frequent (1.e., tests if its frequency meets or exceeds 
the user-specified minimum frequency threshold). Based on these 
frequent patterns of cardinality k (i.e., frequent k-itemsets), the 
algorithm then generates candidate patterns of cardinality k+1 (i.e., 
candidate (k+1)-itemsets). This process is applied repeatedly to 
discover frequent patterns of all cardinalities. A disadvantage of the 
Apriori algorithm is that it requires Kmax scans of the database to 
discover all frequent patterns (where Kmax is the maximum 
cardinality of discovered patterns). During each database scan, the 
algorithm mines frequent patterns in a transaction-centric fashion 
that it finds what k-itemset is supported by (or contained in) a 
transaction. 

Like the classical Apriori algorithm, the Eclat algorithm [15] 
also uses a level-wise bottom-up paradigm to mine frequent 
patterns. However, it does so by using an item-centric fashion that 
it counts the number of transactions supporting or containing the 
patterns. To elaborate, with Eclat, the database is treated as a 
collection of item lists. Each list for an item x keeps IDs of 
transactions containing x. The length of the list for x gives the 
frequency of 1-itemset {x}. By taking the intersection of lists for 
two frequent itemsets a and f, we get the IDs of transactions 
containing (aUB). Again, the length of the resulting (intersected) 
list gives the frequency of the pattern (aUB). Eclat works well when 
the database is sparse. However, when the database is dense, these 
item lists can be long. 

As an extension to Eclat, the dEclat algorithm [16] also uses a 
level-wise bottom-up paradigm. Unlike Eclat (which uses keeps 
sets of IDs of transactions containing itemsets, i.e., tidsets), dEclat 
uses diffset which is the set difference between tidsets of two 
related itemsets. Specifically, the diffset of a k-itemset a = yU {a}, 
where y is its (k-1)-prefix, is defined as the difference between the 
tidset of a and the tidset of y. To start the mining process, dEclat 
computes the diffset of 1-itemset {x} by taking the complement of 
the tidset of {x}, i.e., 
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diffset( {x}) = tidset(TDB) — tidset({x}) = (tj| x € tic TDB}, 
where ¢; represents the j-th transactions in the transaction database 
TDB. The diffset of {x} then captures those transactions that do not 
contain {x}. For a transaction database TDB containing 
n transactions, the frequency of 1-itemset {x} can then be computed 
by 
n — |diffset( (x])], 

where |diffset({x})| represents the length of diffset({x}). 
Afterwards, let k-itemsets a=yU{a} and P = yU{b} sharing a 
common (k—1)-prefix y such that (aUB) = yU (a, b}. Then, taking 
the set difference between diffset(D) and diffset(a) gives the diffset 
of the resulting (aUB). The frequency of (aUB) can be computed by 
subtracting the length of diffset(aUB) from the frequency of a. 
dEclat works well when the database is dense. However, when the 
database is sparse, these diffsets can be long. 

Alternatively, the VIPER algorithm [17] represents the item 
lists in the form of bit vectors. Each bit in a vector for a domain 
item x indicates the presence (bit “1”) or absence (bit “0”) of 
transaction containing x. The number of “1” bits for x gives the 
support of 1-itemset {x}. By computing the dot product of vectors 
for two frequent itemsets a and p, we get the vector indicating the 
presence of transactions containing (a U p). Again, the number of 
“1” bits of this vector gives the frequency of the resulting pattern 
(a U B). VIPER works well when the database is dense. However, 
when the database is sparse, lots of space may be wasted because 
the vector contains lots of Os. 


3 Our Hybrid Vertical Mining Algorithms 


3.1 Our Serial Hybrid Vertical Mining Algorithm 


In Section 2, we described three key item-centric serial frequent 
pattern mining algorithms—namely, Eclat, dEclat, VIPER. 
Depending on the data density, one algorithm would perform better 
than others. In other words, there is no clear winner. As such, we 
suggest here a serial hybrid algorithm that takes the best of the three 
worlds. 

As frequent patterns are mined using a level-wise bottom-up 
paradigm in all these three algorithms, our hybrid algorithm 
switches from one to another based on the densities of dataset: (a) If 
the dataset is dense, our hybrid algorithm switches from using 
transaction IDs to using diffsets early, i.e., switching from Eclat to 
dEclat early. (b) If the dataset is sparse, our hybrid algorithm uses 
transaction IDs for longer period of mining time before it switches 
to diffsets, i.e., switching from Eclat to dEclat late. (c) If the number 
of transaction IDs in tidsets (for Eclat) or diffsets (for dEclat) higher 
than ~12.5% of the product of the numbers of all transactions and 
domain items, our hybrid algorithm uses bit vectors than 
transaction IDs, i.e., switching from Eclat or dEclat to VIPER. 


3.2 Our HPC Based Hybrid Vertical Mining 
Algorithm with MapReduce 


To handle big data, we extend our serial item-centric hybrid 
algorithm to become a high performance computing (HPC) based 
item-centric hybrid algorithm. Specifically, we distribute the 
original horizontal transaction-centric dataset and to store the 
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transactions as frequent equivalence classes in different units (1.e., 
transformed into a vertical item-centric dataset). With the map 
function, the mappers apply hybrid vertical mining on each 
partition without the need of any additional information from other 
workers. Unlike the traditional vertical mining algorithms like 
Eclat or dEclat, our hybrid algorithm does not choose just a single 
strategy. Instead, it chooses different strategies based on the 
densities of datasets. Specifically, our hybrid algorithm first 
captures transaction IDs (i.e., tidsets), which consumes less time in 
calculating the support. Our hybrid algorithm then computes 
differences among the sets of transaction IDs (i.e., diffsets). The 
switching from one strategy to another is based on the densities of 
datasets as stated in Section 3.1. 


3.3 Our Edge/Fog Computing Based Hybrid 
Vertical Mining Algorithm by Using 
Networking Services 


When mining patterns with cloud computing, local data are 
transmitted from IoT devices to data cloud centers, in which data 
are then partitioned and redistributed among several worker nodes. 
To mine patterns from IoT devices with edge/fog computing, local 
data are transmitted from the IoT devices to their local networking 
services lying in between the IoT devices and the usual data centers. 
Afterwards, the local networking services then perform data 
aggregation and data mining in order to discover locally frequent 
patterns (i.e., patterns that are locally frequent with respect to the 
data collected from one or more IoT devices served by the local 
networking services). Here, depending on the representation of 
transaction IDs (e.g., whether data are represented by tidsets, 
diffsets or bit vectors), our algorithm applies set intersections or dot 
products of represented data to find patterns that are locally 
frequent on the data supported by the networking services. 
Frequency of these patterns can be computed by measuring the 
size/length of tidsets or diffsets or by counting the number of 1s in 
the bit vectors. Once the data representation switches, our hybrid 
algorithm uses the corresponding mining techniques (e.g. 
intersection, dot product). 

Then, our hybrid algorithm takes the union of these locally 
frequent patterns. After taking the union to form global candidate 
patterns, the algorithm calculates the frequency of these global 
candidates on the local IoT device. The algorithm then transmits 
the frequency values of these candidates to other IoT devices (or a 
master node), and sums the frequency values in order to discover 
globally frequent patterns in the network. 


3.4 Our Edge/Fog Computing Based Hybrid 
Vertical Mining Algorithm on Local IoT 
Devices 


When mining patterns with cloud computing, local data are 
transmitted from IoT devices to data centers, in which data are then 
partitioned and redistributed among several worker nodes. When 
mining patterns with edge/fog computing through local network 
services, local data are transmitted from IoT devices to local 
networking services. Knowing that some IoT devices are more 
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powerful in such a way that more computations can be performed 
locally, we explore an alternative algorithm for mining patterns 
with edge/fog computing. Specifically, to mine patterns from IoT 
devices with edge/fog computing, local data are kept on individual 
IoT devices in order to discover locally frequent patterns (i.e., 
patterns that are locally frequent with respect to the data collected 
from individual IoT devices). Again, depending on the 
representation of transaction IDs (e.g., whether data are represented 
by tidsets, diffsets or bit vectors), our algorithm applies set 
intersections or dot products of represented data to find patterns that 
are locally frequent on the data supported by the networking 
services. Frequency of these patterns can be computed by 
measuring the size/length of tidsets or diffsets or by counting the 
number of 1s in the bit vectors. Once the data representation 
switches, our hybrid algorithm uses the corresponding mining 
techniques (e.g., intersection, dot product). 

After taking the union of these locally frequent patterns to form 
global candidate patterns, we calculate the frequency of these 
global candidates on each IoT device. We then transmit their 
frequency values to other IoT devices (or a master node), and sum 
the frequency values in order to discover globally frequent patterns 
in the network. 


4 Evaluation Results 


To evaluate our hybrid frequent pattern mining algorithm, we first 
set up a feasibility study. We then set up experiments to compare 
our algorithm with existing frequent pattern mining algorithms by 
using benchmark datasets. 

Results of our feasibility study on our serial hybrid algorithm 
suggests that it switches from using tidsets to using diffsets when 
the frequency of the subset is at least half of that of the superset. 
Similar results apply to our three other hybrid algorithms. 

Results of our feasibility study on our MapReduce based hybrid 
algorithm suggests that it leads to a benefit of the switch. 
Specifically, as each worker performs the vertical mining 
simultaneously, each worker may choose a different strategy based 
on the current system load. Moreover, as another benefit, our 
hybrid algorithm only needs to scan the database once in the entire 
mining process. Once vertical mining is performed by each worker, 
the results (i.e., frequent itemsets) are collected from these workers 
to the driver. 

Results of our feasibility study on mining frequent patterns with 
edge/fog computing through local networking services show that it 
is more efficient and practical than apply the hybrid algorithm on 
cloud computing. A reason is that, when compared with cloud- 
computing based mining, computations for mining and analysis 
with edge/fog computing based mining are performed closer to 
end-users. 

Results of our feasibility study on mining frequent patterns with 
edge/fog computing on local IoT devices show that it is more 
efficient and practical than that with edge/fog computing through 
local networking services, which was shown to be more efficient 
and practical than that with cloud computing. A reason is that, when 
compared with cloud-computing based mining, computations for 
mining and analysis with edge/fog computing based mining on 
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local IoT devices are performed closer to end-users. Computations 
for mining and analysis with edge/fog computing based mining 
through local network services are performed even much closer to 
end-users. 

Pattern mining with these two edge/fog computing approaches 
reduces the latency and network bandwidth, enables geographic 
focus, and increases reliability and security. 


5 Conclusions 


Pattern mining aims to discover from data the implicit, 
previously unknown and potentially useful information and 
knowledge in the form of patterns. As a popular pattern mining task 
in the data mining domain, numerous frequent pattern mining 
algorithms have been proposed over the past 20 years. They 
focused on algorithmic efficiency, functionalities, and other 
aspects. These algorithms have been applied to various real-life 
applications running in serial, parallel, and/or high-performing 
computing environments. Examples include the classical Apriori 
algorithm, which is a transaction-centric level-wise algorithm that 
mines frequent patterns in a bottom-up fashion. Other examples 
include the Eclat, dEclat and VIPER algorithms, which are item- 
centric level-wise algorithms that also mine frequent patterns in a 
bottom-up fashion. 

In this paper, we reviewed many existing frequent pattern 
mining algorithms including the aforementioned serial algorithms, 
as well as other distributed and parallel algorithms, and algorithms 
running in high performance computing (HPC) environments (e.g., 
computer clusters, computer grids, data cloud centers). Our key 
contributions include our presentation of item-centric level-wise 
hybrid algorithms that mine frequent patterns in serial, in the 
MapReduce environment, using networking services, and on local 
IoT devices. These hybrid algorithms take the benefits of all 
worlds, in which data are represented as collections of transaction 
IDs (as in Eclat), differences/changes among collections of 
transaction IDs (as in dEclat), and collections of bit vectors (as in 
VIPER). The use of edge/fog computing in vertical frequent pattern 
mining enables the mining to be performed closer to the end users, 
the reduction of transmitted data, and the saving of runtime. As 
ongoing and future work, we further explore other performance 
enhancements. 
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ABSTRACT 


This paper describes the development of AIDA, the Ancient Inscrip- 
tion Database and Analytics system. The AIDA system currently 
stores three types of ancient Minoan inscriptions: Linear A, Cretan 
Hieroglyph and Phaistos Disk inscriptions. In addition, AIDA pro- 
vides candidate syllabic values and translations of Minoan words 
and inscriptions into English. The AIDA system allows the users 
to change these candidate phonetic assignments to the Linear A, 
Cretan Hieroglyph and Phaistos symbols. Hence the AIDA system 
provides for various scholars not only a convenient online resource 
to browse Minoan inscriptions but also provides an analysis tool to 
explore various options of phonetic assignments and their impli- 
cations. Such explorations can aid in the decipherment of Minoan 
inscriptions. 
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1 INTRODUCTION 


The ancient Bronze Age Minoan culture flourished on the island of 
Crete and some other islands and coastal areas of the Aegean Sea 
between about 3000 and 1500 BCE [10]. The Minoan language, a 
Pre-Greek, non-Indo-European language, survives only in ancient 
inscription in three different types of scripts, namely the Linear 
A script (about 1500 inscriptions), the Cretan Hieroglyphic script 
(about 350 inscriptions), and the Phaistos Disk inscription, which 
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has a unique inscription consisting of printed seals for each sym- 
bol [11]. There are no widely accepted decipherments of these three 
Minoan inscriptions, although there are many proposals. One prob- 
lem with decipherment attempts is that there are too few longer 
inscriptions in the three different scripts. About 1200 Linear A in- 
scriptions contain only one or two symbols. It would be highly 
beneficial for a decipherment effort to bring together all three types 
of inscriptions into a common format. Since Linear A inscriptions 
are the most common, this would mean in practice the translation 
of the Cretan Hieroglyph and the Phaistos Disk inscriptions into 
Linear A. That is one of the goals of our Minoan database system. 
The basis of the translation to Linear A are two functions. First, 
a mapping from the Cretan Hieroglyph symbols to the Linear A 
symbols. Second a mapping from the Phaistos symbols to Cretan 
Hieroglyph symbols. 

We present the AIDA system, short for Ancient Inscription Data- 
base and Analytics system, which brings all three types of Minoan 
inscriptions into the same Linear A format and provides a powerful 
search capability. The acronym name AIDA is famous from Verdi's 
opera of the same name, where the Ethiopian princess is called 
Aida. That name is said to derive from Aita, an ancient Egyptian or 
other African women’s name. It may be also cognate with Finnish 
dita, which means "mother" in English. In any case, one of the major 
goals of the AIDA system is to find possible cognates of the Minoan 
words. 

In AIDA, one can enter any Linear A sequence and all the words 
and the database system will return all the words and inscriptions 
that contain that sequence including the Cretan Hieroglyph inscrip- 
tions and Phaistos Disk blocks whose translations into Linear A 
contain the search sequence. Similarly, one can search a Cretan 
Hieroglyph sequence and bring up all three types of inscriptions 
that contain the equivalent signs. In addition, our system provides 
the English meaning of a set of words from the lexicon in [16] and 
translations of texts from [14-17]. 

The rest of this paper is organized as follows. Section 2 describes 
all the data sources for our research. Section 3 shows the entity 
relationship diagram of our database and outlines the main imple- 
mentation features. Section 4 describes the AIDA system's user 
interface and some queries. Section 5 outlines the data analytics 
that the AIDA system is planned to perform. Section 6 discusses re- 
lated work. Finally, Section 7 gives some conclusions and directions 
for further research. 


2 DATA SOURCES 

For the Cretan Hieroglyphic inscriptions we used the book Cor- 
pus Hieroglyphicarum Inscriptionum Cretae, abbreviated CHIC, by 
Olivier et al. [12]. For the Linear A inscriptions we used Godart and 
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Olivier’s book Recueil des inscriptions en Linéaire A [8], which is 
commonly abbreviated GORILA by the first letters of the authors 
and the title. For the Phaistos Disk, we used Evans [7]. These three 
reference books introduced, respectively, a special numbering of 
the Cretan Hieroglyph, Linear A and Phaistos Disk symbols. The 
CHIC book also gave a numbering of the Cretan Hieroglyphic in- 
scriptions. Evans [7] called the two sides of the Phaistos Disk, sides 
A and B and gave a numbering of the blocks on side A from A1 in 
the inside to A30 on the outside and on side B from B1 in the inside 
to B31 on the outside. 

Cretan Hieroglyphs and Linear A are just two of about ten dif- 
ferent scripts that belong to the Cretan Script Family, whose devel- 
opment was studied using bioinformatics phylogenetic algorithms 
in Revesz [13]. The discovery of the Cretan Script Family played an 
essential role in the decipherment of the Phaistos Disk [15], Cretan 
Hieroglyphic inscriptions [17] and Linear A [16]. All these decipher- 
ments were based on one-to-one mappings between pairs of scripts 
within the Cretan Script Family. When a script with known pho- 
netic values is mapped to a script with unknown phonetic values, 
then the phonetic values of the former script also can be mapped, 
at least tentatively, to the symbols of the latter script. 

Revesz also gave one-to-one mappings from the Phaistos Disk 
symbols and to the Cretan Hieroglyphs [13] and from the Cretan Hi- 
eroglyphs to the Linear A symbols [17]. These mappings enable the 
transliteration from any of the three types of Minoan inscriptions 
into the other two types. 


3 DATABASE DESIGN AND 
IMPLEMENTATION 


Our entity-relationship diagram is shown in Figure 1. The entity 
relationship diagram contains a relation for the Phaistos Disk sym- 
bols (PD-Symbol), the Cretan Hieroglyph symbols (CH-Symbol) 
and the Linear A symbols (LA-Symbol). These three sets of symbols 
are indexed, respectively, by the identification numbers given by 
Evans [7], CHIC [12], and GORILA [8]. We also have relations that 
store the Phaistos Disk block numerical sequences (PD-Block), the 
Cretan Hieroglyph number sequences (CH-Inscriptions), and the 
Linear A words (Lin-A-Lexicon). Between any type of inscriptions 
and the corresponding type of symbols, there is a many-to-many 
containment relation. Therefore, there are three containment rela- 
tions: Contains-PD, Contains-CH and Contains-LA. Finally, relation 
Lin-A-inscriptions stores the translated Linear A inscriptions by a 
number sequence and a meaning. There is a many-to-many relation- 
ship between the Lin-A-Lexicon relation and the Lin-A-Inscriptions 
relation. For each Lin-A-Lexicon tuple we store the Linear A word's 
number sequence as well as its meaning, which is an English word 
or phrase. We indicate one-to-one relationships by arrows and the 
number 1 on the links between the entity sets and the relationship 
set. Similarly, we also indicate by the symbols N and M on the links 
the many-to-many relationships. 

For the implementation, we used the MYSQL database system 
for storing and retrieving data. We built the system interface, which 
will be described in more detail in Section 4.1, using Boostrap V4.3.1, 
HTML and CSS. We are running a PHP script to handle the input 
from the user interface and provide output to the users. The whole 
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system is hosted at the following University of Nebraska-Lincoln 
server: https://cse.unl.edu/~revesz/aida.php. 


1 1 1 1 
PD-Symbol o» CH-Symbol LA-Symbol 
N N N 


(hcd) CH-Inscriptions 


CH-Se 
Gg í 
D Lin-A-Inscriptions S 


Figure 1: The entity-relationship diagram. 


4 THE USER INTERFACE AND QUERIES 


Next we describe the AIDA system’s user interface in Section 4.1. 
After that, the following three sections present different types 
of queries. In particular, Section 4.2 presents Linear A queries, 
Section 4.3 presents Cretan Hieroglyph queries, and Section 4.4 
presents English word queries. 


4.1 The User Interface 


Figure 2 shows the user interface of the AIDA system. The top 
line of the user interface contains some clickable choices regarding 
various information options about the AIDA system, including a 
brief user’s manual that describes how to use the system. The next 
three lines of the user interface shows three prompt boxes. The 
user can select any of these three prompt boxes to enter a query. 
The first prompt box allows the user to enter a Linear A number 
sequence. The second prompt box allows the user to enter a Cretan 
Hieroglyph number sequence. The third prompt box allows the 
user to enter an English keyword. In case the user knows the actual 
symbol sequence but forgot the associated numbers, the bottom of 
the AIDA user interface shows a matrix of Linear A symbols. Below 
each Linear A symbol, its identification number is given based on 
the GORILA book [8]. 


4.2 Linear A Queries 


By Linear A queries we mean queries that search for the occur- 
rences of various substrings in the Minoan lexicon and the Minoan 
inscriptions stored in the AIDA system. As an example of a Linear 
A query, we use the sequence 57-7-67. Given that number sequence, 
the system returns the answer shown in Figure 3. We see that it is 
used in three different Linear A inscriptions. For these inscriptions 
the entire Linear A number sequences and the GORILA identifica- 
tion strings are returned. After the GORILA identification string 
we also list in parentheses the GORILA volume number and page 
number separated by a slash where the inscription is described. 
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Linear A Home About Contact 
Search by Linear A 
number seq: 
Search by Cretan H. 
number seq: 
Search by English Word: 
Submit 
References Table: 
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074 076 077 078 079 080 081 082 085 086 087 
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164 171 180 188 191 301 302 303 304 305 306 
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013 016 017 020 021 021F 021M 022 022F 
C ^ A a [i ? Ñ fi K 
034 037 038 039 040 041 044 045 046 
L k ü e Y. b Y li b 
059 060 061 065 066 067 069 070 073 
A &m T 7 Y a z A m 
100 118 120 120B 122 123 191A — 1391B 131C 
a 4 O Q © € 69 4 s 
307 308 309A 309B 309C 310 311 312 313A 


Figure 2: The AIDA user interface. 


In addition, the sequence 57-7-67 also occurs in several Linear 
A lexicon words. One of the lexicon words means "star" while 
other lexicon words mean "moon" It appears that in the Minoan 
language the word for "moon" is expressed as either the compound 
"star+queen" or "star- head", that is, the moon was viewed as the 
queen or the chief of the stars. 

The AIDA system also returns in the last column the syllabic 
transliteration of the Linear A word for star. The syllabic values are 
based on Table 12 in [16]. Figure 3 shows that the syllabic value for 
"star" is ke-es-ki. The syllabic values of the Linear A symbols can 
be updated by the users, which would allow some experimentation. 
However, any change of syllabic value of a symbol needs to be 
carefully investigated for its implications. The AIDA system is 
designed to facilitate such an investigation because the users can 
retrieve all the words and previous translations that may contain a 
particular symbol and then see the effect of any change. 

The AIDA system also displays in the third and fourth column 
the putative cognates and the languages in which those cognates 
occur, respectively. For example, the word kiska is a Selkup word 
that also means "star" in that language. Note the phonetic similarity 
between ke-es-ki, which was likely pronounced as keski and the 
Selkup word kiska. The phonetic similarities and the same meaning 
suggest that they are cognate words. Other possible cognate words 
retrieved by the AIDA system are yus in Khanty, korísin Mansi and 
kusku in Hattic, all meaning "star" 


4.3 Cretan Hieroglyph Queries 


Similar to Linear A queries, a Cretan Hieroglyph query retrieves all 
the Minoan inscriptions that contain a particular Cretan Hieroglyph 
sequence of its Phaistos Disk or Linear A equivalent sequences. As 
an example of a Cretan Hieroglyph query, we used the sequence 
25-04-03 as shown in Figure 4. 

The AIDA system gave an output table where the first column 
shows the equivalent Linear A sequences of two Minoan inscrip- 
tions. The first inscription is a block of the Phaistos Disk, namely 
block B3. Normally under the CHIC column we would have the 
Cretan Hieroglyphic inscription identification number from [12], 
which ranges from #1 to #331. However, there are a few inscriptions 
that can be considered Cretan Hieroglyph inscriptions, although 
they do not appear in [12]. One of these inscription is the Arkalo- 
chori Axe inscription, which we added to the database as the Cretan 
Hieroglyphic inscription CHIC #332. The AIDA system was able 
to bring these two inscriptions with different scripts together and 
show their relationship. The existence of the common subsequence, 
which in Linear A would be the following number sequence: 004- 
712-028, according to the numbering of the Linear A symbols in [8]. 
The common subsequence implies that it is likely some suffix when 
the inscriptions are both read from left to right. In a similar man- 
ner, a user may find all the occurrences of other candidate prefixes 
and suffixes. The prefix or suffix nature of the sequences would 
be strongly supported by their multiple occurrences at the begin- 
ning or the end of short inscriptions or the blocks within larger 
inscriptions such as the Phaistos Disk. 
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8-59-28-301 -54-57-57-7-67-57-31-31 -60-28-39-6-80-41 -26-4-59-6- 
60-4-10-37-55-28-1 mother 


55-56-38-57-7-67-4-4-39-29-27-67-13-28-57-31-10-6-77-6-4-28-51 


star [and] Moon ancestor gleam. Blow-V. 3rd 


All cave spirits: Moon rise IMP big! Cave spirit 1O Za 2 (5/19) 


PK Za 8 (4/26) 


SG queen cloud old ancestor 


57-7-67-4-4-39-29-27 


Moon ancestor gleam 


PK Za 15 (4/41) 


moon 57-7-67-648 

moon 57-7-67-57-31-31-60-13 

star 57-17-67 kiška 
star 57-7-67 xus 
star 57-7-67 kong 
star 57-7-67 hügy 
star 57-7-67 kuSku 


cf. star + queen > Kasku Hattic 


cf. star + head > Moon 


Selkup ke-es-ki 
Khanty ke-es-ki 
Mansi ke-es-ki 
Hungarian ke-es-ki 
Hattic ke-es-ki 


Figure 3: The result of querying the Linear A sequence 57-7-67. 


648-017-004-712-028 07-23-35-06-02 


031-041-304-004-712-028-029- 
010-028-086-044-002-712-031- 
028 


27-31-50-25-04-03-66-60-03-40-55-70-04-27-03 


332 


Figure 4: The result of querying the Cretan Hieroglyph sequence 25-04-03. 


4.4 Word Queries 


A word query simply retrieves all the lexicon items and translated 
texts where some English language keyword appears. The English 
language keyword can be any word in the English language. If it is 
not found in the lexicon or the translations, then the AIDA system 
returns the message "not found" As an example of a word query, 
we used AIDA to look up all the items that contain the word "light" 
as shown in Figure 5 and the word "moon" as shown in Figure 6. 

As Figure 5 shows, the word "light" occurs not only in the dictio- 
nary entry for "light" but also in the dictionary entry for "sunlight." 
The entry for "light" is associated with two different Linear A num- 
ber sequences, the first is 8-27 and the second is 8-80, which has 
syllabic transliterations fe-ne and fe-nu, respectively. These two 
pronunciations may have been dialectical variations, or they may 
had slightly different connotations that currently we do not know. 
However, both of these words seem cognate with other words such 
as fény in Hungarian and páju in Sami. 

The word for "sunlight" has the Linear A number sequence 302- 
344-28, syllabic transliteration pj-ai-ku and possible cognate paike 
in the Estonian language, where the word also means "sunlight". 
More importantly, one can see the possible development from Sami 


päju to Estonian paike with a possible suffix -ke at the end of the 
word. 

Figure 6 shows the word query for "moon" As we saw in Sec- 
tion 4.2, in the Minoan language the moon is considered either the 
queen of stars or the head of stars. Therefore, we see the sequence 
57-7-67, which means "star", appear in both definitions of "moon." 
In addition, the word "moon" appears also in some translated Linear 
A inscriptions. Finally, Figure 7 shows the word query for "star" It 
has some overlaps with the previous queries because of the above 
mentioned reasons. 


5 DATA ANALYTICS 


The AIDA system can do some simple data analytics. It can count 
the number of occurrences of any substring. It can also return the 
most frequent substrings of length k in the inscriptions database, 
where k is any integer greater than or equal to two. In the future we 
plan to extend these basic statistics to a more sophisticated analysis 
where the most frequent substrings are analyzed to check whether 
they occur preferentially in the beginning, the middle or the end 
of the inscriptions. This more sophisticated analysis could help 
determine whether the most frequent substrings are prefixes, word 
roots, or suffixes, and whether the root words are likely to be nouns 
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8-80 


light fény Hungarian fe-nu 
light 8-80 baeggjo Sami fe-nu 
light 8-80 paju Sami fe-nu 
light 8-27 fény Hungarian fe-ne 
light 8-27 baeggjo Sami fe-ne 
light 8-27 paju Sami fe-ne 
sunlight 302-344-28 paike Estonian pj-ai-ku 
sunlight 302-344-28 fény Hungarian pj-ai-ku 
sunlight 302-344-28 fehér Hungarian pj-ai-ku 


[Let the] cloud come, [the] Dan [river] flow, old Tamuz bring heat, shine 41-41-17-363-310-1-81-73-363-16-73-47-6-60-8-54-39-4-58-45- 
344-344-28 


sunlight 


KN Zf 31 (4/155) 


Figure 5: The result of querying the word "light". 


moon 57-7-67-648 


moon 57-7-67-57-31-31-60-13 


cf. star + queen > Kasku Hattic 


cf. star + head > Moon 


All cave spirits: Moon rise IMP big! Cave spirit mother 


8-59-28-301-54-57-57-7-67-57-31 -31 -60-28-39-6-80-4 1 -26-4-59-6- 


10 Za 2 (5/19) 


60-4-10-37-55-28-1 


All cave spirits, all stars [and the] shiny queen [Moon] cloud-NOUN- 
PREP run high! 


star [and] Moon ancestor gleam. Blow-V. 3rd SG queen cloud old 
ancestor 


Moon ancestor gleam 


8-59-28-301-54-57-8-7-67-4-41-60-13-8-A363-10-6-26-77-57-41-8- | PK Za 12 (4/38) 
3-51-3-57-57-3-16 

55-56-38-57-7-67-4-4-39-29-27-67-13-28-57-31 -10-6-77-6-4-28-51 PK Za 8 (4/26) 
57-7-67-4-4-39-29-27 PK Za 15 (4/41) 


Figure 6: The result of querying the word "moon". 


or verbs. The AIDA system also could help discover relationships 
among various scripts, strengthening recent work that shows that 
Near Eastern scripts have spread both to the west and to the east [4]. 


6 RELATED WORK 


Currently, there is no other online Minoan inscription database 
system available for public use. However, there is a Linear B in- 
scription database system called the DAMOS system, which is an 
abbreviation for Database of Mycenaean at Oslo [1]. The Linear B 
script was a successor of the Linear A script [11]. Linear B was the 


earliest form of Greek writing that is generally agreed to have been 
deciphered correctly in 1953 by M. Ventris and J. Chandwick [2, 19]. 

While not a database system, J. Younger’s website at the Uni- 
versity of Kansas, http://www.people.ku.edu/ jyounger/LinearA/, is 
a frequently consulted online resource for Linear A. It provides 
an online table of Linear A words with cross references, called 
"supports" on the website, to all the inscriptions in which the word 
occurs. Since this website is not a database system, it is not possible 
to look up in which inscriptions a word occurs by using a simple 
query. Instead a user needs to manually browse a list of Linear A 
inscriptions, which are provided on separate webpages, one for the 
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all stars 8-7-67-4 cf. all, star 
chief star 8-7-67 cf. head, star 
star 57-7-67 kiSka 

star 57-7-67 xus 

star 57-7-67 kons 

star 57-7-67 hügy 

star 57-71-67 kuSku 

star 56-38 csillag 


fe-es-ki-se 

fe-es-ki 
Selkup ke-es-ki 
Khanty ke-es-ki 
Mansi ke-es-ki 
Hungarian ke-es-ki 
Hattic ke-es-ki 
Hungarian za-la 


[Sun] shine-IMP and [stars] gleam-IMP down happy love-ACC every 
day 


All cave spirits, all stars [and the] shiny queen [Moon] cloud-NOUN- 
PREP run high! 


All cave spirit-INSTR. chief star ancestor gleam down love fa ko fa j 
chief queen cloud-POSS-PREP rise IMP big out high! 


star [and] Moon ancestor gleam. Blow-V. 3rd SG queen cloud old 
ancestor 


8-27-24-27-7-301-39-44-24-57-59-53-28-453-23-8-57-37 


8-59-28-301-54-57-8-7-67-4-41-60-13-8-A363-10-6-26-77-57-41-8- 


3-51-3-57-57-3-16 


8-59-28-301-54-38-8-7-67-4-4-1-39-4-53-8-70-8-363-8-31-31-60- 
13-10-6-26-77-6-34-28-99-6-73-6-41-26-28-6-57-3-16 


55-56-38-57-7-67-4-4-39-29-27-67-13-28-57-31-10-6-77-6-4-28-51 


KN Zf 13 (4/153) 


PK Za 12 (4/38) 


PK Za 11 (4/84) 


PK Za 8 (4/26) 


Figure 7: The result of querying the word "star". 


Haghia Triada inscriptions, another for the Knossos inscriptions, 
and so on at each separate location. 


7 CONCLUSIONS AND FUTURE WORK 


The development of the AIDA system is challenging because it 
requires knowledge of the important database system design prin- 
ciples as well as a knowledge of Minoan inscriptions and the basic 
concepts of comparative linguistics. These three areas of knowledge 
are uniquely brought together in our AIDA system. The AIDA sys- 
tem has a potential to be a widely used resource for many scholars 
in the humanities in the fields of classics, history and linguistics. 
As a future work, we hope to extend the system with other ancient 
languages, such as Sumerian [5, 18], Elamite [6], and the Indus 
Valley Script [3, 20]. As our database grows, we also investigate the 
possibility of using ElasticSearch [9] to make queries more efficient. 
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ABSTRACT 


Database Management Systems (DBMS) need to handle large updat- 
able datasets in on-line transaction processing (OLTP) workloads. 
Most modern DBMS provide snapshots of data in multi-version con- 
currency control (MVCC) transaction management scheme. Each 
transaction operates on a snapshot of the database, which is calcu- 
lated from a set of tuple versions. High parallelism and resource- 
efficient append-only data placement on secondary storage is en- 
abled. One major issue in indexing tuple versions on modern hardware 
technologies is the high write amplification for tree-indexes. 
Partitioned B-Trees (PBT) [5] is based on the structure of the 
ubiquitous B*-Tree [8]. They achieve a near optimal write amplifi- 
cation and beneficial sequential writes on secondary storage. Yet 
they have not been implemented in a MVCC enabled DBMS to date. 
In this paper we present the implementation of PBTs in Post- 
greSQL extended with SIAS. Compared to PostgreSQL’s B* -Trees 
PBTs have 50% better transaction throughput under TPC-C and a 
30% improvement to standard PostgreSQL with Heap-Only Tuples. 


CCS CONCEPTS 


* Information systems — Data access methods. 
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1 INTRODUCTION 


In times of Big Data, IoT, cloud computing and social media, datasets 
are large and and update-intensive. Database Management Systems 
(DBMS) are predestined to manage these datasets, but are the bottle- 
neck of most data-intensive operations. Datasets have a near-linear 
growth and cannot be entirely located in main memory in most 
cases. 
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Whenever a transaction modifies a tuple in a multi-version con- 
currency control (MVCC) enabled DBMS such as PostgreSQL, a 
new version-record (tuple version or simply version) of this tu- 
ple is produced. Existing approaches in such DBMS organize ver- 
sions as a doubly-linked list, where each version record has a cre- 
ation_timestamp and an invalidation_timestamp, which is initially 
empty. Whenever a transaction TX, creates successor version 
upon an update, both version records are modified, setting the 
invalidation_timestamp of the predecessor, and the creation_time- 
stamp of the successor to timestamp(T X5). We consider a novel 
version-organization, based on SIAS [3]. Every version has a cre- 
ation timestamp, and a single backward reference to its predecessor. 

This version model assumes that every tuple comprises a set 
of tuple versions that are available as persistent version records, 
physically stored as a chain. While processing a query under a 
transaction, only the *visible" versions should be determined and 
passed on for processing. The Snapshot Isolation visibility criteria 
hold, i.e. a version tx.vy is visible to transaction TX4Q, if: 

(1) creation timestamp(ty .vy) = 

MAX(creation_timestamp(tx.v{a4LL})) < timestamp(TX4Q); 

(2) transaction status(creation timestamp(ty .vy)) = 

COMMITTED; and 


(3) creation_timestamp(tx.vy) € Leoncurrent(TX4Q). 


- Hence the version visibility check is very I/O intensive. 

With this model, searching for one or a few data tuples with spe- 
cific search predicates in base tables is an expensive operation with 
super-linear growth. In times of Big Data, when datasets typically 
cannot be entirely located in main memory, full table scans are 
not an option. Indexes describe an additional access path to tuples 
located in base tables. The index structure of a B*-Tree [8] became 
ubiquitous in DBMS [7]. The tree-index allows accessing data in a 
key-sorted order in logarithmic time. Index record and structure 
maintenance operations in the sorted tree-structure cause a high 
write amplification (WA) to secondary storage - this effect is ampli- 
fied by the maintenance of tuple versions. Characteristics of novel 
semiconductor-based storage technologies (fast reads, asymmetry, 
out-of place updates, high parallelism and wear) are not leveraged. 
Indexing tuple versions is still an open research area, considering 
characteristics of modern storage hardware. Index structures need to 
handle modifications of index records out-of place for reduction of 
write amplification (WA) on secondary storage media. 

Partitioned B-Trees (PBT) [5] is based on the ubiquitous B*-Tree 
[8]. They achieve a near optimal WA and beneficial sequential writes 
by collecting modifications in a main memory partition and forcing 
related nodes to secondary storage. Already persisted data is not 
physically affected by further modifications - i.e. maintenance of 
tuple versions in the index does not amplify WA. 
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In this paper we present PBTs in PostgreSQL extended with the 
version model of SIAS. Compared to PostgreSQL’s B*-Trees, PBTs 
achieve a 50% improved transaction throughput under TPC-C and 
a 30% improvement to standard PostgreSQL with Heap-Only Tuples. 

The structure of this paper is as follows. We give an overview of 
related indexing approaches in Section 2. In Section 3 we discuss 
the conflict in design decisions of MVCC. We outline the algorithms 
of PBT in Section 4 and verify our assumptions in Section 5. 


2 RELATED WORK 


Most popular indexing approaches in DBMS are based on B*-Trees, 
which can result in high write amplification (WA) on random up- 
dates in large datasets. PostgreSQL uses Heap-Only Tuples (HOT) 
as indirection layer to reduce index management operations. Index 
records reference items in base table, which point to tuple versions 
in the heap node. Corresponding tuple versions are located on the 
same node and are identified by processing the version chain. If a 
tuple version becomes garbage collected, the item is modified to 
reference the next relevant version. This indirection layer reduces 
index modifications, but cannot avoid WA of indexes. Furthermore 
the WA on base table nodes is increased for large datasets. 

Maintaining out-of place tuple versions enable high parallelism 
and a beneficial append-only sequential write pattern to secondary 
storage for base tables. Snapshot Isolation Append Storage (SIAS) 
makes use of the natural append-only characteristics of tuple ver- 
sions and achieves an increased throughput of 30% in comparison to 
PostgreSQLs standard base table organization in an OLTP workload 
[3]. Every version has a creation_timestamp, and a single backward 
reference to its predecessor. This version model assumes that every 
tuple comprises a set of tuple versions, which are available as per- 
sistent chain of version records. However, SIAS does not support 
HOT indirection layer, whereby indexing effort is increased. 


2.1 MV-IDX 


MV-IDX [4] is based on a B*-Tree and maintains a virtual identifier 
for each tuple and in-memory data nodes for each version as an 
indirection layer. With Snapshot Isolation Append Storage (SIAS) 
[3], WA on base tables is reduced in comparison to HOT, but index 
management operations can cause in-place updates and a high WA 
- e.g. if an indexed attribute value becomes modified. Partitioned 
B-Trees (PBT) handle updates to indexed attribute values in a main 
memory partition in the PBT-Buffer, whereby WA is optimized. 


2.2 Write Optimized B-Trees 


Write Optimized B-Tree [6] aims to achieve an append-only write 
pattern in a B*-Tree. It is organized like a traditional B*-Tree with 
limited modifications - utilizing fence keys instead of sibling point- 
ers enables this structure to perform out-of-place writes of modified 
nodes. Due to missing sibling pointers, cursors on scans cannot 
find siblings. Therefore, its sibling has to be requested in the parent 
node. Additional complexity in buffer management occurs. 

Write Optimized B-Trees do not solve the problem of high WA 
in B*-Trees. If a node gets evicted, it is written in a log-structure. 
However, a node is not protected from further modifications and 
already indexed data is written manifold. Partitioned B-Trees (PBT) 
collect modifications of leaf nodes in a PBT-Buffer until the partition 


gets evicted. Every record is written exactly once, except for garbage 
collection - WA is near optimal. 


2.5 LSM-Trees 


LSM-Trees [9] are optimized for high update rates and reduce WA 
due to collecting and pre-sorting modifications in a fixed-sized main 
memory component, which becomes evicted on a certain threshold 
and replaced by a new main memory component. As a result, several 
components exist on secondary storage media and are frequently 
merged in larger components. Pre-sorted records are migrated and 
sequentially written in a log-based pattern. bLSM-Trees [10] are 
based on the structure of LSM-Trees, however, there is a fixed 
count of three components for reduction of read amplification (RA). 
Furthermore, bloom filters protect components from unnecessary 
reads for point queries. Scheduling of merge areas and insertion 
rates between components reduce steals and replacement selection 
increases the effective amount of merged records. 

Advantages of Partitioned B-Trees (PBT) are manifold. First, the 
single tree-index structure leverages the logarithmic relation be- 
tween capacity and height of the tree. Index nodes are commonly 
used and buffered across partitions, whereby RA is reduced at same 
height like larger components in LSM-Trees. Second, compression 
methods, like suffix truncation, perform better in one large set 
of records, than in several smaller sets [2]. Third, partition sizes 
are self-balanced and workload adaptive due to commonly used 
PBT-Buffer. Managing component thresholds in LSM-Trees requires 
deep knowledge about the workload and administrative effort. Last, 
partitions of PBTs are more flexible than components in LSM-Trees. 
A partition can be created to absorb bulk loads with low effect on 
concurrent workload and merged or cropped from tree-structure 
afterwards, based on result of the transaction. Furthermore, Cached 
Partitions can be similar created out of result sets of frequently 
queried records and reduce RA. 


3 MVCC TRANSACTION MANAGEMENT 
SCHEME 


Multi-version concurrency control (MVCC) is the most popular 
transaction management scheme in modern DBMS. For instance, it 
is used by Oracle, MySQL-InnoDB, HyPer, SAP HANA, MongoDB- 
WiredTiger and PostgreSQL. In theory, MVCC enables high par- 
allelism, because reading transactions do not block concurrently 
writing transactions. Modifications result in a new tuple version. 
Furthermore, in snapshot isolation, modifications of writing trans- 
actions do not block concurrently reading transactions, because for 
each transaction a visible tuple version can be returned. 

The DBMS differently implement MVCC transaction manage- 
ment scheme. Fundamentals in design decisions are (a) Concur- 
rency Control Protocol, (b) Version Storage, (c) Version Ordering, (d) 
Garbage Collection and (e) Index Management [11]. In fact, that (a) 
Concurrency Control Protocols deal with serialization strategies (first 
updater / committer wins) and has low effect on indexing and re- 
sulting write and read I/O patterns to secondary storage, we focus 
on and outline points (b) to (e) in the following. Afterwards, we 
give a short discussion. There is a conflict dilemma in usage of the 
optimal design decisions for large datasets and characteristics of 
modern storage technologies. 
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Figure 1: Dilemma: Conflicts in Design Decisions 


3.1 Version Storage 


Tuple versions correspond to one logical tuple. They form a linked 
list, which represents a version chain. Mainly, a version of a tuple 
can be maintained in two different ways - logically or physically. 
The first type means that for each modification of a logical tuple 
a delta record indicates the difference to another version. These 
delta records are connected and required to restore a tuple version. 
Physical storage means, that each tuple version is entirely stored. 

In both cases, following information is required: a content / delta 
that is stored in the DBMS for a tuple, logical timestamps for vali- 
dation and invalidation, and a reference to its predecessor and / or 
successor version — based on the version ordering. Modifications are 
performed in-place or out-of-place. 

Considering the characteristics of modern storage technologies, 
out-of-place updates are preferable due to less write amplification 
(WA) to secondary storage for large datasets. Furthermore, this 
behavior enables higher parallelism than in-place updates, whereby 
the tuple version has to be exclusively latched for modification. 
This is possible with logical and physical version storage. Delta 
records tend to consume less space than physical tuple versions, 
but require further versions for tuple reconstruction. 


3.2 Version Ordering 


A version chain of a logical tuple forms a linked list. A doubly linked 
list enables knowledge of predecessor and successor, however, it 
requires additional latches and reduce parallelism on modification 
[11] in comparison to singly linked lists, which require a version 
ordering. Discovering the visible tuple version to a transaction 
snapshot requires to follow the version chain from an entry point, 
until the visible version was found. Basically, there are two different 
ordering methods - old-to-new (O2N) and new-to-old (N2O) for 
singly linked lists. For both methods in-place as well as out-of-place 
updates are possible. In case of O2N-ordering, the entry point is the 
oldest tuple version in version chain. A visibility check requires to 
process all predecessors, beginning from the oldest tuple version. 
Updates (insertion of tuple versions) require at least modifications of 
predecessors invalidation timestamp and reference. N2O-ordering 
means that the entry point is the most recent tuple version, which 
references to its predecessor. Queries in OLTP transactions can find 
the visible version very well, because the most recent tuple version 
is the entry point of the version chain. 

Considering the characteristics of modern storage technologies 
N2O-ordering for physical version storage result in best WA with 
append-only characteristic for large datasets, because maintenance 
of validation timestamps of recent versions are sufficient. Other 
combinations require in-place updates, which shrink benefits in 
parallelism of a singly-linked list and WA. 


3.3 Garbage Collection 


Tuples are modified multiple times. In MVCC modifications result 
in successor versions. Predecessors become obsolete, if they are 
no more visible for any active transaction. Garbage collection (GC) 
reclaims space and can improve RA, especially for O2N version 
ordering. However, GC increases WA on secondary storage. Tuple 
level GC can be performed as background vacuum and cooperative 
cleaning process [11]. In the first case, a background thread scans 
and purges obsolete versions. Cooperative cleaning uses the process 
operation of version chains for detection of obsolete tuple versions. 
GC operations have to minimize effects on WA and additional access 
paths - e.g. if the entry point changes, indexes require adaptions. 


3.4 Index Management 


Complexity of index management strongly depends on version 
storage and ordering techniques. A lossy result from index scans is 
not acceptable, so in theory every tuple version should be indexed. 
This approach can result in massive WA and RA, e.g. in case of 
B*-Trees. Most popular indexes do not support visibility checks in 
MVCC. Therefore, the version chain of the base table is required to 
determine the visible tuple version. As a result, at least the entry 
point of the version chain is indexed. Modifications in the tuple 
versions content, which affect search key columns have to become 
visible to an un-lossy index. There are two possibilities to map index 
records to tuple versions in base tables. First, physical references — 
the entry point tuple version in base tables can be directly accessed, 
but changes to the entry point location result in index modifications. 
Second, an indirection layer with logical references is implemented. 
Therefore, each version of a tuple is referenced with an unique 
identifier. Index records reference to this unique identifier in the 
indirection layer, which references the entry point version location. 
This approach can reduce index modifications. 

In-place updates of tuple versions in base tables reduce index 
maintenance operations with physical indirection. However, as 
outlined in Sections 3.1 and 3.2, an out-of-place append-only scheme 
brings benefits on modern storage technologies. Indexing tuple 
versions with an indirection layer can reduce index maintenance 
of the preferred storage scheme, if the search key attributes of the 
tuple content remain constant. However, inserting and modifying 
tuples in a traditional strict alphanumeric-sorted index structure 
result in in-place updates on index nodes and high WA. 


3.5 Discussion 


We outlined relevant design decisions for storing tuple versions 
in MVCC transaction management scheme. Every design brings 
benefits for specific tasks and requirements. We focus on large 
update-intensive datasets, which cannot be entirely located in main 
memory. Therefore, we introduced the dilemma in different designs. 

Modifications are preferably stored as physical tuple versions 
in base tables, due to tuple reconstruction costs. Out-of-place up- 
dates reduce WA. This can be achieved by a new-to-old (N2O) 
version ordering, because invalidation timestamps of predecessors 
can be reconstructed from previously processed successors cre- 
ation timestamps and predecessors remain constant. Garbage col- 
lection (GC) is required for space reclamation, but brings additional 
complexity to data structures. Effects on indexes and WA should be 
minimized. 
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A N2O version ordering requires index maintenance for every 
new tuple version, because the entry point of the version chain 
changes. A logical indirection layer could reduce index maintenance 
effort, however, a sequential write pattern to secondary storage 
cannot be achieved with traditional indexing structures. Due to high 
update rates to indexes, caused by insertion of new tuple versions on 
modifications, traditional index structures become a bottleneck. 

We decided to implement Partitioned B-Trees (PBT) in Post- 
greSQL extended with SIAS[3], due to its beneficial append-only 
write I/O properties to secondary storage. We describe the structure 
and algorithms of PBT and how it is able to achieve the preferable 
sequential write pattern to secondary storage. 


4 APPROACH: PARTITIONED B-TREES 


Partitioned B-Trees (PBT) [5] are based on traditional B*-Trees [8] 
and make use of its intrinsic and well studied algorithms with few 
modifications. The essential difference is an introduced artificial 
leading key column - the partition number. An index record con- 
sists of a partition number, its search key columns and a physical 
tuple reference or an unique virtual identifier for tuple assignment. 
Every different partition number value describes a single parti- 
tion. This enables the PBT to maintain partitions within one single 
tree-structure in alphanumeric sort order. Partitions can support 
additional functionalities, like reorganizations or bulk loads [5]. 


PBT-Index DB Buffer 
IS 
\ " 
^ P\\\ > PBT-Buffer 
Partition no. 0 | 1 | Nnà n-1 = 


sequential write of Partition n-1 


Figure 2: Sequential write of a Partition 


PBTs write any modification of index records exactly once on 
eviction of a partition, except for later reorganization or garbage 
collection operations, what enables a beneficial sequential write 
pattern to secondary storage. This is realized by evicting all related 
leaf nodes of a partition. Leaf nodes of modifiable main memory 
partitions are stored in a separate area of the buffer cache - the 
PBT-Buffer. Records can only be inserted or updated in partitions, 
which are located in the PBT-Buffer. In case of full PBT-Buffer, a 
main memory partition is written to secondary storage: First, a 
new partition is created to support ongoing modifications and the 
partition, which has to be evicted, becomes immutable. Second, a 
bloom filter and prefix bloom filter is created, gets filled with all 
index records in the recently closed partition. Last, all leaf nodes are 
sequentially written to secondary storage.PBT indexes in MVCC 
are not lossy, however, they are able to return a set of entry points 
to candidate tuples, which have to be verified in a visibility check. 
We describe the index operations in a PBT: 


Insert Operations. are only performed in a mutable main memory 
partition. Therefore, the first search key column is prepended with 
its partition number. The index structure is traversed and the index 
record is inserted at its regular position in the B*-Tree structure. The 
leaf node is guaranteed to be located in main memory. Uniqueness 
constraints are supported by first performing a read operation. 


Algorithm 1 Partitioned B-Tree - INSERT 


Input: Regular |attryai|, ref 
Output: ErrCode 

1: procedure INSERT(|attryai|, ref) 

2: Let partinsert — MAx(PartitionsList) 
Let part rec «— FORM_PART_REC(partinsert, |attryai|, ref) 
UNIQUENESS. CONSTRAINT. CHECK(|at tr; a; |) » check all Partitions 
return DO REGULAR INSERT(part rec) 


Update Operations. can be performed in-place, if the index record 
is still in a mutable main memory partition. Therefore, only the 
physical tuple reference field has to be modified. If the index record 
of the updated tuple is in an evicted immutable partition on sec- 
ondary storage or search key columns are affected, an insert in a 
mutable main memory partition is performed. 


Algorithm 2 Partitioned B-Tree - UPDATE 


Input: Regular |attroal,old | lattroatnewl ref 
Output: ErrCode 
1: procedure UPDATE(|attroal,old |» lattryalnewh ref) 


2: Let partupdate — MAx(PartitionsList) 

3: Let part_rec ~ FORM_PART_REC(partyupdate, |4ttrval,old|, ref erence) 
4: if |attroal,old| = |attryal,new| and FIND(part_rec) then 

5: return in_place_update(|attryal,old|, ref) 

6: else 

T: return INSERT(|attroa], new, ref) 


Delete Operations. are performed similar to update operations. 
The physical tuple reference points to a tombstone record. 


Algorithm 3 Partitioned B-Tree - SCAN 


Input: Regular |attroal,min l» lattroat, max 

Output: |ref's| 
1: procedure scan(|attryal, min|, lattroat,maxl 
2: for each partscan € PartitionsList » start MAx(PartitionList) 
3: Let part_recmin €— FORM PART REC(Dartscan, |attryal, min|) 

4: Let part reCmax *— FORM PART. REC(Dar scan; |attryal,max|) 

5: if |attryal,min|--|attrval,max| € partscan-filter then 

6: Let ref <-FIND_IN(part_reCmin, part_reCmax) 

7 

8 


|ref s|.App(ref) 


! loop 
9: if not HAsNEx1( ) then 
10: break 
Tis Let ref <-NEXxT() 
12: |ref s|.App(ref) 
13: return |refs| 


Search and Scan Operations. are not allowed to be lossy, however, 
they return a set of candidate tuples, which have to be verified in 
a visibility check in base table. The query search predicates are 
modified to match the search key columns in a PBT - a partition 
number is prepended to the first search key column. The partitions 
in a PBT are traversed and scanned from the highest to the lowest 
numbered partition. This behavior is beneficial for performing reads 
on unique search key column values. If a matching index record 
was found, further lower numbered partitions do not have to be 
processed and the algorithm can break up earlier whereby RA is 
reduced. The returned candidate tuples are send to the visibility 
check in base table. Order requirements for scans are processed 
afterwards. Read and scan operations can be accelerated by filter 
techniques. (Prefix) bloom filters reduce RA and latencies of point 
and range queries. 
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Figure 3: OLTP Benchmark Throughput Evaluation 
5 EXPERIMENTAL EVALUATION 


We show Partitioned B-Trees (PBT) in comparison to traditional 
B*-Trees in PostgreSQL; a RDBMS with MVCC transaction manage- 
ment scheme. PostgreSQL uses an O2N version order and physical 
tuple version storage. Index records have a physical reference to 
items located in base tables - denoted as B-Tree (PG9.04/HOT). Post- 
greSQL base table storage was modified to SIAS with a beneficial 
append-only write pattern and N2O version ordering. We evaluated 
B*-Trees and PBT with physical and logical tuple reference. 

We deployed the DBMS on an Ubuntu 16.04 server with Intel(R) 
Xeon(R) 3.50GHz processor and an Intel SSD secondary storage 
device. We used the well-known DBT-2 [1] OLTP benchmark. 

First, we evaluate throughput of B-* Tree (PG9.04/HOT) as well 
as SIAS with B*-Trees with physical and logical reference in the 
DBT-2 benchmark. In Figure 3a, we show the throughput for differ- 
ent dataset sizes. The buffer cache of the DBMS is set to 600MB. The 
dataset size increase with the warehouse count. B-* Tree (PG9.04/HOT) 
performs well, if most buffers are located in main memory. Up- 
dates are performed in base tables by HOT. The index mainte- 
nance effort is low, due to this indirection. If the workload becomes 
write-intensive, the throughput falls rapidly. SIAS has a scalable 
throughput [3], but increased effort in index management shrinks 
performance with physical reference B*-Tree updates. With an 
indirection layer, index management is reduced to inserts and up- 
dates of search key columns, whereby the throughput is increased 
by up to 20% and SIAS performs better than PG9.04/HOT at 1200 
warehouses. Effects of indirection layer on index management are 
minimal for PBT. The throughput difference is 6% at the dataset of 
1000 warehouses. As the dataset grows, there is almost no difference 
in throughput between PBT with physical and logical reference. 
The index is able to absorb additional modifications. PBT with SIAS 
has a 50% increased throughput in relation to comparable B..-Trees 
with physical references and about 30% with indirection layer at 
2000 warehouses. The append-only approaches (SIAS and PBT) out- 
perform PG9.04/HOT, as the benchmark becomes write-intensive at 
700 warehouses - up to an improvement of 30% at 2k warehouses. 

Partitioned B-Trees (PBT) append modifications to the dataset in 
a main memory partition. Effort of look-ups and especially of scans 
increase by number of partitions (see Figure 3b), because in theory 
every partition has to be traversed. Up to 25 partitions were created 
for update-intensive indexes over the test duration. Point queries 
can break look-up on first matching record, which is visible to a 
transaction snapshot. Point queries can skip partitions, based on 
bloom filters and increase throughput up to 10%. The benchmark 
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includes several scans. Prefix bloom filters include a fixed set of 
scan attributes and increase total throughput by another 10%. 

We evaluated the write pattern of PBT (see Figure 4a). The dia- 
gram indicates the eviction of a main memory partition to secondary 
storage. Each red cross indicates the write of an index node. Several 
parallel and sequential writes of extends are shown in the diagram. 
Once an index node was evicted to secondary storage, its contents 
never change. PBT achieves the desired beneficial write pattern. 

In Figure 4b we show the requests on index nodes (blue) and base 
table nodes (red) for an write-heavy OLTP benchmark. Requests on 
cached nodes are displayed brighter than fetches from secondary 
storage. The results are calculated for equal throughput over the test 
duration and all tables and indexes. PBT requires more requests on 
index nodes due to partitioning of index records and larger record 
sizes. Most requests are on buffered nodes, because many queries 
can be answered in the main memory partition. Index records of 
recent tuple versions are common to be located there. Requests can 
benefit from better cache hit rate in comparison to B*-Trees. 


6 CONCLUSION 


We presented different design decisions in MVCC transaction man- 
agement scheme for large-scale data sets and update-intensive 
OLTP workloads, regarding the characteristics of modern storage 
technologies. We outlined resource-efficient append-only version- 
organization in base tables and its conflict with index management. 
We firstly implemented Partitioned B-Trees in a DBMS with MVCC 
transaction management scheme and evaluated their throughput 
and characteristics. PBT achieves an up to 50% increased throughput 
in relation to comparable B* Trees. 
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ABSTRACT 


Imperfect data express their meanings incompletely and the Theory 
of Fuzzy Sets arises as mathematically support for the interpretation 
of those data. The union of these concepts describes a new data 
type, called fuzzy data. We discuss the use of fuzzy data in Graph 
Databases. Previous works define fuzzy queries on Graph Databases 
but the data stored is a regular and perfect data. In that works we 
extent a Graph Database and lets the users store information in the 
fuzzy and imperfect data. The databases management system Neo4j 
is proposed to developed the application and the Cypher database 
languages to describes the imperfect data definitions. We uses a 
social network use case to illustrated the works, 
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1 INTRODUCTION 


More and more applications are now being made that use data 
captured in real time, from different sources and formats. These data 
do not always have the clarity and complete meaning, admitting 
subjective interpretations, or, still, needing complements, so that 
their comprehension is possible. Due to these characteristics, these 
were termed as imperfect data [5]. The theory of fuzzy sets arises as 
a way to aid in the interpretation of these data, since it modifies the 
traditional Boolean concept [15] [7]. Traditionally an element may 
or may not belong to a data set, represented mathematically by 0 or 
1. In the theory of fuzzy sets this concept is expanded, allowing to 
define how much an element can belong to a set. A value between 0 
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and 1, is assigned to the element, in order to represent the relevance 
this element has in front of the related set. In this way, it became 
possible to mathematically associate an imperfect data with a set of 
possible interpretations. The data generated by the union of these 
concepts are referred to as fuzzy data. 

The patterns of traditional databases structures do not support, 
in their totality, storage and manipulation of fuzzy data [9]. This is 
due to the complexity that exists in relating a fuzzy data to an crisp 
data. Various approaches have been developed in order to offer 
adequate support to this type of data, in the most varied database 
structures. However, the methods developed completely meet the 
requirements, or even become complex depending on the type of 
the data and the structure that will carry it. 

Graph databases [1] have gained prominence over the various 
existing data structures. Largely, because of its flexible feature, 
which makes it possible to represent the complex relationship be- 
tween different types of data. It also has a management system, 
the Neo4j, that facilitates the visualization of the data structure 
and its relationships. This system is considered the favorite among 
developers, it offers a quick and efficient support in the handling of 
data, with its own easy-to-understand language, the Cypher. Given 
this, the possibility of incorporating the fuzzy data in this model is 
questioned. Since unstructured data models allow the interaction 
between different types of data. 

The purpose of this article is to present a method that incorpo- 
rates fuzzy data into a database, allowing the insertion of different 
types of data. Due to the complexity of relating different types of 
data in traditional database models, we consider the use of graph 
databases here. The advantages of the graph model, as compared to 
the manipulation of complex data, when compared to other mod- 
els of databases, caused in the choice of this one. Besides being a 
model little explored in the literature. Thus, this article present an 
application developed for the use of fuzzy data in graph databases, 
both in data structure and in the query and insertion instructions. 
It seeks to complement the Neo4j system, allowing imperfect data 
to be used, based on the theory of fuzzy sets. The remainder of 
this article is organized as follows. Section 2 presents the defini- 
tion of concepts belonging to fuzzy sets and imperfect information. 
Some important aspects of the graph databases as well as the Neo4j 
management system are presented in section 3. Section 4 we have 
the presentation of the problem in more detail and the way the 
proposed application is applied as a solution of these. Section 5 
presents the results obtained in the application of a use case, as well 
as expectations for the further improvement of the application. 
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2 IMPERFECT INFORMATION AND FUZZY 
SETS 


Imperfect information is related to incomplete, inaccurate or vague 
data. Understanding of stored data can be some complex, because 
the difficult to establish the relation between data of different types, 
even if they make reference to a same set of information. Imperfec- 
tions in information are classified into five main types: imprecision, 
uncertainty, vagueness, ambiguity, and inconsistency [6]. The occur- 
rence of one type of imperfection does not excludes the possibility 
of a second or third type, related to the first. As an example, “I’m 
95% certain that John was born in the 90’s”. It demonstrated the oc- 
currence of uncertain along with imprecision, respectively, in the 
same information. 


Imperfect data should not be interpreted solely as true or false 
boolean values. The definition or interpretation of imperfect data 
is conditioned to the subjectivity of the user, based on aspects 
experienced by the user. Factors such as place of birth and age 
are examples of aspects that affect the interpretation of data. For 
example, the classification of a person of 52 years of age is relative, 
since, from the point of view of a child, most adults are classified 
as elderly. A second case consists of a person's height rating. In a 
certain region where it is common for the average height of the 
population to reach at most 170cm, a person with this height is 
considered high. However, in another region with a higher average 
height, this same height can be considered only as median. 

The theory of fuzzy sets assists the definition of imperfect data. 
Rather than forcing the assignment of an exact value to the imper- 
fect data, we must mathematically classify its possible interpreta- 
tions for their relevance. Fuzzy sets, initially proposed by [15], is 
related to the concepts of imperfect information. This expanded 
traditional Boolean concepts, which considers that an element can 
only belong or not to the set. In the theory of fuzzy sets, a coefficient 
of membership to an element can be assigned, thus representing 
how much that element belongs to the set. Thus, where U is a set of 
u elements of discrete or continuous universe. A fuzzy set F inU 
is characterized by a membership function represented by ur(u), 
which associates each element of U with values áÁZin the range 
[0, 1]. The set F can be expressed by the set of ordered pairs of U, 
that is, F = {(u, ur (u)) |u € U}. The support set of F is defined 
as a subset of U with degree of membership greater than 0, such 
that: Supp (F) = (u|u € U, u (u) > 0). An inflection point on F is 
the element whose membership value is: up(u) = 0,5, considered 
the greatest point of uncertainty of the set. The a — cut on Fisa 
threshold level, considered valid elements, provided that they are 
equal or above that value. This must have the degree of membership 
above 0 to 1, that is: Fa = (ulur (u) 2 a} to 0 < a < 1. Elements 
defined by a «æ — cut with value close to 1, are the elements con- 
sidered the most belonging to the set. In general there are several 
functions that can be used to obtain a membership degree. Among 
the most common are the trapezoidal and triangular functions [15]. 
The trapezoidal function is expressed for the quadruple (A, B, C, D), 
where C(F) = [B,C] and S(F) = [B- A,C + D]. 

The trapezoidal function classified a 52-year-old is an adult or 
and elderly person. The proposed age has 0.2 of membership degree 
with the term elderly and 0.8 with the adult term, making the 
classification more compatible. 
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The fuzzy logic set with the concepts of imperfect data has given 
rise to the so-called fuzzy data. Fuzzy data defined from the fuzzy 
logic, its interpretations can be determined mathematically. 


3 GRAPH DATABASES 


The representation of information through graphs has been used 
by several models. A graph G is composed of vertices V and edges 
E, formally expressed by: G = (V, E). Basic concepts about the 
various models of graph databases, their aspects and comparisons, 
can be seen in [13], [4], [8], [1] e [2]. The properties graphs has 
been adopted by most developers, as standard model for graph 
databases. In that graphs, the attributes are basically defined to 
vertex and edges properties and we can define the more complete 
representation of the relation of the data. Graph databases has 
been used for semi-structured models, such as XML, [2] and the 
interest has grown, since it has the capacity to support complex 
data structures. 

The management system Neo4j” is considered most popular by 
developers of graph databases, [4]. This is due to the support offered 
to different programming languages, besides perfectly integrating 
the concepts belonging to the databases of property graphs. The 
Neo4j further includes a proprietary manipulation language, the 
Cypher. The main philosophy of the Cypheris to have a simple 
and intuitive syntax language, clear feature for easy reading and 
understanding, [12]. This has increased the interest in language, 
because unlike the other traditional models, the complexity in the 
development is not related to the intimacy in the use of these 
languages. 


3.1 Fuzzy Graph Databases 


A traditional graph database is composed of exact data represented 
by vertices or properties of these and the relationships that exist 
between these data, represented by edges. Integration with fuzzy 
logic caused the emergence of databases fuzzy graphs. This model 
may represent especially data where relationships are not fully 
defined. An overview of concepts about fuzzy graphs is presented 
in [14]. 

In fuzzy graphs the fuzzy information can occur both in the data 
existing in the attributes of the vertices and in the structure of the 
relationships of a graph database, [11] e [10]. In this way, it indicates 
that there is a need to allow fuzzy data to be entered as property 
values of a vertex, as well as that a fuzzy data is used to define a 
relationship. This insertion causes the Boolean view of whether or 
not a particular relationship exists to be reinterpreted. We can then 
analyze a path or distance between vertices in a differentiated way, 
being influenced by the definition of these fuzzy data. 

In Figure 1 of [11], there is the occurrence of fuzzy data. It is 
possible to notice that the relationships of the vertices of the graph 
have values that modify their understanding. The relationship of 
the node "Pierre" with the vertex "Serge" is defined by an edge 
labeled “contributor” of value “0.04”. Considering the other edges 
with "contributor" labels and their values, we have identified that 
the highest value assigned is the value “1”. Any value below the 
value “1” is then considered to be a relationship of lesser intensity. 


2 https://neo4j.com 
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Figure 1: Representation Model of the fuzzy graph. 


It must also consider in the model of fuzzy graphs its capacity in 
support both perfect and imperfect data types. The model by [11] 
has these concepts, offering equivalent support both for imperfect 
as perfect information. The Figure 1, demonstrates these concepts 
through their edges. 

Due to the nature of the graph database model, the representation 
of the complex information is simplified. The greatest complexity 
is that can relate a fuzzy data to a precise or at least interpretation. 
Therefore, a large part of the studies carried out with the objective 
of to incorporate the fuzzy logic in graph databases are directed to 
the execution of instructions. These queries should assist users in 
obtaining the regardless of the form in which this information is 
stored. 


4 AFUZZY DATA DEFINITION 


Based on the proposals of [10] and [3], this work presents an ap- 
plication developed in order to work on the integration of fuzzy 
data. Unlike the above-mentioned works, the application will not be 
limited to the execution of nebulous queries, which seek to obtain 
data from a flexible syntax. 

In that model, we aim to incorporate new functionalities, such 
as the insertion of fuzzy data and the execution of both flexible 
and traditional queries, performed in perfect or fuzzy data. Was 
used RabbitHole! based on the Neo4j system. The reason for using 
this system is due to the need to modify the system internally to 
accept the new types of data. As programming language was used 
the language Java? in conjunction with the Apache Maver? library. 

The proposed environment allows the modification and testing of 
new functionality, without compromise the functions of the official 
system. In this way, we develop an algorithm for data interpretation, 
for query and insert instructions containing fuzzy data. It is neces- 
sary to "teach" the system about some fundamental parameters, to 
analyze a fuzzy data. Was developed a second algorithm, responsi- 
ble for storing the definitions declared by the users about the fuzzy 


1 https://github.com/neo4j-contrib/rabbithole 
3 https://www.java.com 
3 https://maven.apache.org 
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data. That document, provides for the interpretation algorithm to 
be executed. Finally, it was necessary an algorithm to measure and 
classify the relation of the data declared with the obtained data. 
The classification algorithm evaluates the membership degree mea- 
sured in the comparison of the data declared in the instructions 
with the data obtained in response. The classification is only used 
with instructions of queries and not of insertion. 

In general, to enable the insertion or queries fuzzy data, or even 
the classification of the answers obtained, we must to inform the 
system of the parameters necessary for the user to understand them. 
Unlike the [10] model, defining fuzzy data at the time of execution 
of instructions would not be feasible, since it would be necessary 
to define each possible given type existing in the model. Thus, 
it must be store the fuzzy data definitions, so that they could be 
used later. The structure generated for in that case is a Fuzzy Data 
Definition Document (FDDD). That structure, separated from the 
system, allows modified algorithms to query the fuzzy data defined 
by the users. In addition, it enables the sharing of settings for use 
by other systems or users. In this proposal, the XML document 
was used as the storage base for the definitions of the fuzzy data, 
because it has the ability to be understood by several systems, in 
addition to the library Dom4P Java which has several manipulation 
resources of this type of document. In this way, integration with 
the system was benefit. 

For the description of the fuzzy data, we extended the Cypher 
language to understand the developed of fuzzy data. In this way, a 
user can define this data directly in the application and at run time. 
The use of Regular Expressions helps in reading the syntax, mak- 
ing this new functionality possible. An example of the fuzzy data 
definition statement syntax is presented in the following statement. 


DEFINE NODE PROPERTY linguistic_variable fh age AS young = 
{10, 20, 30, 40} 

Each type of imperfection has a different form of interpretation. 
Therefore, it is not possible, so far, to define a generic type of 
imperfect data. Thus, an interpretation syntax has been developed 
for each data type. In Figure 2, we have the relation of the data 
types and the syntax definition, as it interpreted by the system. 
That relation is storage in FDDD. 

The proposed classifications allow the application to identify 
which group of possible interpretations the data belongs to. This 
allows a relation between this data and others in the same group to 
be established, considering that they have a certain affinity. 

Each group uses a specific type of function, such as trapezoidal, 
triangular, among others, developed to obtain the degree of perti- 
nence of the relation of data associated with it. 

The interpretation will be based on the classification of the type 
to which the cloudy data is associated, its position in the graph and 
the base parameters informed in the definition of the data type. 

In this way, it allows the application to interpret and associate 
different types of data, perfect, imperfect or imperfect that have 
different types of imperfection. 

The proposed classification model has the purpose of allowing 
different types of imperfect data to be used, since these can be 
defined by their own users. 


^ https://dom4j github. io/ 
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DEFINE EDGE PROPERTY null values : knows AS dne, unk, ni, open ———— — —» ( 
DEFINE NODE PROPERTY interval : age AS 1 ————————— — —— —— — — ———» f 
DEFINE EDGE PROPERTY set. knows AS (value A, value B, value Cj—— ——————» N 
DEFINE NODE PROPERTY linguistic variable : age AS young = (10,20,30,40) —» Ge the interpretations for the terms 


DEFINE EDGE fuzzy. expression : friend AS CLOSE = ( a/b ) TO {a,b} —————» ( Defines the method of interpretation of ) 
o 


Defines which null values are 
accepted for the attribute; 


Red 


Defines the range of elements ) 
Re which comprises a set; 
Pa 


Defines a set of values that 
can be related; 


V 


N 


related to linguistic variables; ) 


7 ^ 
fuzzy expressions; 


Figure 2: Defining syntax for imperfect data types. 


This causes the limit of acceptance of which data can be used 
in a database system to be expanded because the limit will be 
determined only by how much the application can interpret from 
this data. 

The classifications that refer to the group of the imperfect data 
type were designed with the purpose of helping the application to 
read and interpret the data used. 

This classification is essential to avoid that the application ob- 
tains duplication of interpretations. Because, an imperfect data 
may be similar to another, but may have different interpretations, 
depending on the context in which it is used. 

In Figure 3 is presented the algorithm to define the different 
fuzzy data types with an step-by-step to define the imperfect data. 
It should be checked if the element definition already exists, because 
this avoids duplication of the definitions, in that case, is prevent 
different interpretations for the same imperfect data type. Other- 
wise, it could not be determined exactly what the real interpretation 
is for this. The application would always use the first definition, 
ignoring the others. 

Finally, a data structure with the definitions declared by the 
users will be generated, according to Figure 4. They are grouped 
the positions of the graph, the identification of the attribute and its 
definitions based on each type of declared fuzzy data. 

By storing and defining fuzzy data, querying and entering data 
becomes possible because the system can understand how to inter- 
pret a cloudy data when it occurs. However, only defining a data is 
not sufficient for the system to perform satisfactorily its insertion. 
A fuzzy data is similar to a common data, differing only by not 
having a similar form to the other related data of this set. Thus, it is 
proposed to use key-symbols to indicate to the system that this is a 
fuzzy data. For example, when the "young" property is inserted into 
a graph database on a vertex, it can be understood as an exact datum 
that refers to the characteristic of a person. However, this data may 
be replacing an exact value unknown at the time, which refers to 
the age property of a person. An "young" in fuzzy logic symbolizes 
a linguistic term that composes a linguistic variable. The use of the 
key symbols overcomes this problem and can be used for each of 
the fuzzy data types, such as: null values, sets, intervals, linguistic 
variables e fuzzy expressions. The Figure 5 shows the relationship 
between the key symbols, the proposed statement syntax, and the 
related data type. 

In addition to assigning the key-symbols to the fuzzy data, the 
Cypher language has been modified to simplify the use of these 


Algorithm 1 Fuzzy Data Definition (FDDD.rml) 
Input: Cypher instruction 
Output: Element created in the file FDDD.rml 


Variables: 

x — Instruction initiated by the term 'DEFINE' 
1: function DEFINITION IMPERFECT DicE(z) 
; Check type of statement z Identifies the imperfect data type 
if there is no element of "graph position" then 


6: creates element for the "graph position 


7: end if 


ex.: property of the vértice 


9 if there is no element of "attribute name” then 
10: creates element "attribute name” ex.: age 
11 end if 


13 if there is no element of "imperfect type" then 


14 creates element "imperfect type” ex.: linguistic variable 


15 end if 
17 Write definitions of imperfect data 


19: end function 


Figure 3: Algorithm for defining the imperfect data in the 
structure of the Fuzzy Data Definition Document. 


symbols. Thus, a user need not know the symbol of the data type, 
it should only indicate the syntax of the instruction to be used. 
CREATE (n:Person (name:^Madisara", age IS young}) 
CREATE (n:Person (name:"Madisara", age:“> > 7 young"]) 

The conversion between the syntax used and the fuzzy data type 
is performed internally automatically by the system. This allows 
the data to be recognized by the system as well as visually by the 
users. 

The query instructions must accept the exact existing data as 
well as the fuzzy data entered by the previous statement, because, by 
making use of FDDD the query is enabled. Unlike the form used in 
the insert statement, key-symbols are not assigned to the fuzzy data. 
The parameters serve to guide the system as to how to interpret 
the data, both in the instruction and in the result. In this way, we 
make it possible to execute the following sample instructions. 


MATCH (n:Person) WHERE n.age IS 18 RETURN n 
MATCH (n:Person) WHERE nage IS young RETURN n 
MATCH (n:Person) WHERE n.age IS om WITH THRESHOLD 0.7 RETURN n 
MATCH (n:Person)-[r:FRIEND &GOOD. FRIENDIS(age, gender)]-»(m:Person) WHERE 
n.age IS young WITH THRESHOLD 0.7 RETURN n 
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<?xml version-"1.0" encoding="UTF-8"?> 


<neo4j> 
<node_property> 
«age type-"INTEGER"» 
«null values» 
«dne /» 
«unk /» 
<ni /> 
«open /» 
«/null values» 
«linguistic variable» 
«young type-"TRAPEZOIDAL" a-"10" b-"20" c-"3e" d-"40"/» 
«child type-"TRAPEZOIDAL" a-"10" b-"20" c-"30" d="40"/> 
«elderly type-"TRAPEZOIDAL" a-"10" b-"20" c-"3e" d-"409"/» 
«adult type-"TRAPEZOIDAL" a-"10" b-"20" c-"30" d-"40"/» 
«/linguistic variable» 
<interval>1</interval> 
<age> 
</node_property> 
<edge_property> 
<fuzzy_expression> 
<good_friend scope="a,b"> 
var r; if(a+b>0.8){r=true}else{false} 
</good_friend> 
«/fuzzy expression» 
«/edge property» 
«/neo4j» 


Figure 4: Example of a Fuzzy Data Definition Document 


In the first instruction is used “18” an crisp operand , however, 
the operator has been modified to indicate to the system that crisp 
and fuzzy data will be accepted as an answer. In the second example, 
a linguistic variable is used, in the case “young”, which refers to a 
set of age values of a person. In the third example, an instruction is 
presented that applies an acceptance threshold. In this the vertices 
that contain a degree of pertinence of value inferior to “0,7” will not 
be accepted and consequently discarded of the final result. The last 
example presents a query statement using the previously defined 
concepts in addition to vertex relationship analysis. In this case, 
the vertex of type "Person" which has at least one relationship with 


another vertex will be obtained. The fuzzy expression "Good Friends" 


evaluates the properties of the vertices "age" e "gender" to define 
the degree of intensity of the relation of these vertices. The degree 
of intensity defined by a fuzzy expression modifies the result of 
the relation of the vertices. Thus, we know that these vertices are 
related to a certain intensity. 


Type of Key symbols of 
Imperfect Data Usage Instruction Representation 

Linguistic Variable IS identification 
Fuzzy Expression 


>>>identification 
#FUNCTION_NAME ({parameters}) | >>:function_name({parameters}) 


Sets — IN SET(xyz) m 2203221 
Intervals IN_INTERVAL[x-y] >>[x-y] 
Null Values DNE(), UNK(), OPEN(), NI() >>*dne() 


Figure 5: Key-symbols used with the relation of the imper- 
fect data type. 


Finally, the application adds to the results the degree of compati- 
bility of the instructions with the fuzzy data obtained, in addition 
to a final general coefficient. In this the fuzzy data used in the in- 
struction with all the data obtained is compared, or the crisp data 
used in the instruction with all the data obtained. The coefficients 
are generated in the range of 0 to 1, where the closer to 1 the more 
compatible is the result with the declared statement. In addition to 
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informing the user about the compatibility between the instruction 
and the results, the coefficients also serve to classify these results. 
“Degree of satisfaction” one can rank among the highest and lowest 
compatible results. 

The Figure 6 presents an example of these concepts. 


Query: 
MATCH (n:Person) WHERE n.age IS young WITH THRESHOLD 0.2 RETURN n 
N Interpreter Result 

(Satisfaction) v 
(0:Person {age:30, name:"Jo&o"]) 1 


(1:Person {age:">>>young", 
name: "Maria"]) 


(4:Person {age:29, name:"Madi"}) 1 
(3:Person {age:34, name:"Bruno"}) 0.6 
(2:Person {age:">>>adult", "m 
hame: "Pedro"]) 

(10:Person {age:35, name:"Vanessa"}) 0.5 


Query took 27 ms and returned 6 rows. pEPEISENPNITEBE] 


Figure 6: Result obtained by the imperfect query instruction 
and its degree of satisfaction. 


In the proposed application the main concepts of fuzzy logic 
such as trapezoidal functions, similarity functions e fuzzy relation 
are incorporated. Each function is used to obtain the degree of 
pertinence of a fuzzy data that represents the best one. For example, 
we assign the value 1 when crisp data are used or obtained, since 
they do not change. In the occurrence of an crisp and a fuzzy value, 
the membership degree is grasped by the trapezoidal function, when 
applicable. In the occurrence of two fuzzy values, the degree of 
pertinence is obtained by the Jaccard’s similarity function. In this 
way it is possible to measure much of the fuzzy data, applying the 
definitions based on the interpretation of the user.. 


5 RESULTS 


The proposed application integrated to the Neo4j system the incor- 
poration of the concepts about fuzzy data used in both the instruc- 
tions and the analysis of data stored in the graph structure. 

It became possible to apply the flexible instructions presented by 
[3], and to improve the [10] model. Using a separate data structure 
of the system enables data to be defined, changed, or shared with- 
out compromising the integrity of the original system. However, 
the application of other types of data is necessary to continue the 
improvement of the model. As a way to validate the proposed appli- 
cation, we will make use of the following use case. As Figure 7 we 
have an example of a common social network. This is represented 
by the database of graphs, its vertices represent Persons and their 
edges the relationships existing between them. 

Representing the social relationship between people is simple in 
general, however, if we conduct a deeper analysis, in many cases we 
will need data that is not fully understood by traditional systems. 
Based on the proposed social network, we create some situations 
that validate the fundamental concepts addressed in the proposed 
application. 

e We will insert a new Person in the network, but we 
do not know his exact age, only that it has the young 
appearance.: 

CREATE (p:Person {name:“Madisara”, age:“> > >young”}) 
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CREATE 


[rm 
(pra) 
(pas 


MATCH (p:Person) RETURN p 


Figure 7: Structure of a social network in the graph database model 


e What young people exist in this network?: 
MATCH (p:Person) WHERE p.age IS young RETURN p 


e Increasing the certainty of the previous result we have. . .: 


MATCH (p:Person) WHERE p.age IS young WITH THRESHOLD 0.7 
RETURN p 


Which young people know the elderly?: 


MATCH (p1:Person)-[]-(p2:Person) WHERE pl.age IS young AND p2.age IS 
elderly RETURN p1, p2 


Which Persons have a "good friend "? Imagining that 
people of close age and same gender are more likely 
to have a stronger relationship.: 
MATCH (p1:Person)-[r:FRIEND #GOOD_FRIENDS(age, gender) 
]->(p2:Person) RETURN p1, r, p2 

All the examples of instructions presented were incorporated by 
the application proposed in this work and did not exist previously, 
neither in the traditional model nor in the works already existing in 
the literature. This information could not be stored in a traditional 
database model, nor could it be interpreted by its applications. For 
the most part, this information would be discarded because it would 
not be considered valid due to its imperfect nature. 

It is expected that with the concepts demonstrated in the de- 
velopment of this application, it can be deployed and integrated 
as one of the functionalities of the Neo4j system. However, other 
validations must be done with different types of fuzzy data in dif- 
ferent situations. In addition, it is proposed to develop an algorithm 
that acts in a generic way among the different types of data. Im- 
proved user interface should be better exploited to provide greater 
support in the definition and use of fuzzy data. It is also indicated 
the application of artificial intelligence concepts, because with the 
integration of user preferences, the definitions about the fuzzy data 
can be automatically defined. 


CONCLUSION 


This paper presents a proposal for the integration of imperfect data 
and fuzzy logic using the graph database management system Neo4j. 
The use of that system was motivated for the uses of RabbitHole 
a previous tools for fuzzy data. In that work we introduced new 
graph database fuzzy data, and also the queries bases on vertex and 
edges directional paths. With the application developed, a graph 
can become even more flexible, allowing new types of fuzzy data, at 
all levels of the graph. The use case based on a social network was 
implemented and showing the validation of the project. We hope 


that this project can continue, and is being improved with more 
imperfect data. In addition it can become a useful tool, contributing 
to improve the relationship between users and various systems. 
In order for the project to continue, we make it available in its 
current form in the GitHub®, can be accessed through the link: 
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ABSTRACT 


Last decades, numerous works have been concentrated on climate 
deregulation. While several studies have analysed the issue on a 
worldwide scale, few works focused on little land territories. For 
instance, this is the case of little islands in the Caribbean Sea, for 
which very few works have been directed to understand the effect 
of climate deregulation at the nearby dimension. In this work, we 
center around the French island of Guadeloupe in the French West 
Indies and we have conducted a study that has two goals. Firstly, we 
analyse climate information from the previous 50 years regarding 
feature observers to climate deregulation. Then, we demonstrate 
the effects of these interruptions on the agricultural area, specif- 
ically on the development of bananas broadly cultivated on the 
island. This methodology, guided by field information, gives a supe- 
rior comprehension of the difficulties presented by environmental 
changes in this area and their effects on certain yields touchy to 
this kind of deregulation. 


CCS CONCEPTS 


* Information systems — Data management systems; * The- 
ory of computation — Data structures design and analysis; e Ap- 
plied computing — Agriculture; 
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1 INTRODUCTION 


Last decades, much work has been done on climate change. While 
some studies have addressed the issue of worldwide climate change [5, 
10], few works have been focused on little territories. For instance, 
this is the case of the small islands of the Caribbean Sea, for which 
a modest number of studies breaking down the effect of climate 
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change can be found in the litterature [8]. However the impacts of 
climate change are being felt in these territories [2, 9]. Thus, un- 
derstanding environmental changes and assessing their potential 
effects on crops and harvests is an essential issue for the flexibility 
of the population and the adjustment of rural practices [11]. 

In this work, we focus on the Guadeloupe French Island located 
in the French West Indies. Unlike the large majority of the works 
that are interested in climate projection models [7], here we are 
conducting a data-driven approach that has two goals. 

(i) First, we analyse data over the past 50 years to identify evidence 
of climate change. Indeed, our first objective is to highlight, using 
several indicators, the climatic deregulation that occurred over past 
50 years. 

(ii) Then, we seek to understand impacts of the observed disruptions 
on the agricultural sector, particularly on banana cultivation. 

To the best of our knowledge, this is the first data analysis ap- 
proach that highlights global and seasonal climate trends occurred 
on the island over the past 50 years. This approach, guided by field 
data, allows to better understand climate tendencies in this terri- 
tory, and their impacts on some sensitive crops, like bananas. But 
they are more curious to know how climate change may evolve 
and impact banana fields in the future. 


2 RELATED WORKS 


For decades, the scientific community has been interested in the 
issue of climate change, but the multiplication of sensors and their 
evolution has led to the need to manipulate large amounts of data. 
This is why big data processing methods are generally used in this 
case. Nevertheless, the use of big data alone does not guarantee a 
relevant climate analysis because of the many particularities of this 
paradigm. It is necessary to adapt traditional methods to new issues 
specific to climate data such as time [6]. But not only big data needs 
to be reviewed, but also data processing and classification methods 
must be modified to address the climate problem, such as artificial 
neural networks or the clustering methods [12, 13], which must 
evolve to take this time aspect into account in their information 
processing to be able to capture complex relationships, discover 
spatial structure and integrate predictive modelling. In addition, it 
shows that several major ocean climate indices are closely linked 
to the Earth’s climate. 

Studies have already been carried out to highlight climate change. 
Some of these studies study it at a broader level such as continents 
or the globe [5]. This paper presents the state of knowledge on 
observed precipitation and attempts to discern some general pat- 
terns at the main regional and continental levels that led to the 
observation of increased variance in precipitation around the globe. 
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Figure 1: Location of sensors on the island 


On the other hand, we have studies that are limited to country lev- 
els [1], but which can lead to climate indicators that are relevant for 
climate change and applicable to other regions of the world, such 
as the amplitude of daytime temperatures or the monthly average 
temperature. 

Nevertheless, there are still geographical areas where studies on 
climate change are somewhat behind, either because of its com- 
plexity or the lack of general interest, the Caribbean area fits in this 
case. As a result, there are few studies dealing with climate indica- 
tors and these developments in the Caribbean basin [2]. Despite 
this, studies on the Caribbean Basin have been conducted, from 
which researchers have come to several conclusions. First of all, the 
particularity of the land-sea interface very present in the Caribbean 
zone and the significant influence of major oceanic phenomena (i.e. 
ENSO, NAO) in the terrestrial climate [9]. Secondly, it has been 
demonstrated that there is a summer drought phenomenon in the 
Caribbean basin [3, 8], which is weaker in the East than in the West 
and which increases over time. 


3 METHOLOGY 


For this study, we use French weather sensors to collect meteoro- 
logical data. The sensors used are located in different communes of 
Guadeloupe, two in Grande-Terre (Le Moule and Les Abymes), four 
in Basse-Terre (Sainte-Rose, Pointe-Noire, Basse-Terre, Capesterre) 
and one in La Désirade. We have sensors present on almost all the 
territory as shown on Figure 1. 

Now that we have seen the position of the sensors, let's look at 
the data itself. First of all, the sensors do not all start at the same 
time, the oldest ones start in 1964 and the most recent one in 1997. 
Then, the sensors measure a maximum temperature value (Tmax), 
a minimum temperature value (Tmin) and finally a precipitation 
value in millimeters (P). We have chosen to group the data in a 
single file for the whole territory, and for this purpose, we have 
averaged the information in relation to the number of sensors. In 


addition, we determined the average temperature after carrying 
out operation Imaxilmin. for each beacon and then averaged this 
value in relation to the number of sensors. Nevertheless, the data 
are partially incomplete, only on temperatures. Indeed, in the 1990- 
2000 decades, many data are missing for sometimes whole months, 
and this, on several sensors, which leads to rather unreliable data. 
In addition, in the early years (1965 - 1975), temperature data are 
given by very few sensors. That is why we have preferred to spread 
them out for temperatures, so all our treatment at the temperature 
level will start in 1975. 


The indicators studied are divided into two categories, those on 
precipitation and those on temperature. For precipitation, we have 
the cumulative monthly rainfall in millimeters and the number of 
rainy days. For temperatures, we have the maximum, minimum, 
amplitude and number of hot days. 


The usual representations for climate data are by month or year. 
We have decided to proceed with a slightly different approach by 
looking at these data from a seasonal perspective. In the Caribbean, 
and more precisely in the West Indies, we have two very different 
seasons. First of all, Lent, which is the cold and dry season, and 
Wintering, which is hot and humid. We have considered that Lent 
extends from December to April and that Wintering extends from 
June to November. Our objective for this study is to identify indi- 
cators of climate change and to see their influences on agriculture. 
Seasonal representation has allowed us to identify more significant 
trends than by month or year, these trends will be discussed in the 
following section. 


4 DATA ANALYSIS WORK 


The first step in our study was to ensure that several witnesses to 
climate change stand out from the data. For this purpose, we have 
extracted different indicators, based on analyses of temperature and 
precipitation data, which we explore at seasonal paradigm levels 
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Although the overall analysis above provides a measure of the 
intensity of the changes that have occurred over the past 50 years, 
it is unfortunately incomplete in identifying the impact of these 
changes on agriculture. It is therefore necessary to go through a 
seasonal paradigm. Indeed, many crops on the island are season- 
dependent. In this situation, adapting agricultural practices and 
respecting the plant development cycle requires a deep understand- 
ing of seasonal changes. Thus, in the second part of our study, 
we focused on changes over the island’s two seasons, the drought 
season and the rainy season, in order to understand how climate 
change affects them. 

Figure 2 shows, for both seasons, how (a) the sum of precipita- 
tions and (b) the number of rainy days, has evolved since 1965. 
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Figure 2: For both seasons: evolution of (a) sum of precipita- 
tions and (b) number of rainy days 


It is normal to note that the general trend is towards a decrease 
in rainfall, given the results over the years. In addition, this de- 
crease can be observed both in the sum of precipitation and in the 
number of rainy days. In addition, it can also be observed that the 
wet season is more affected than the dry season, since the loss of 
rainfall is greater during the rainy season, as can be seen on the 
slopes of the regression curves. 
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Regarding the evolution of temperatures according to the season, 


we focus on (a) sum of temperatures, (b) thermal amplitude, and 
(c) the number of hot days. Figure 3 shows the results obtained. 
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Figure 3: For both seasons: evolution of (a) thermal accumu- 
lation, (b) thermal amplitude and (c) number of hot days 


First of all, it can be noted that over both seasons, the sum of 
temperature increases with the years (see Figure 3(a)). It can be 
noted, however, that the change is much more significant for the 
dry season. 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 311 


IDEAS’19, June 10-12, 2019, Athens, Greece 


The results obtained for the thermal amplitude are very interest- 
ing (see Figure 3(b)). While the trend is downward for both seasons, 
the decrease is much greater for the dry season. 

If these trends continue, there may be a shift in temperature 
amplitudes between the two seasons in the coming years. We would 
therefore have lower temperature amplitudes during the dry season 
than during the rainy season. 

To finish, we focused on the number hot days for both seasons 
(see Figure 3(c)). For the latter indicator, the trends are very simi- 
lar to previous observations. This means that an increase can be 
recorded for the rainy and drought seasons. This confirms the fact 
that the seasons tend to warm up. 


5 CASE STUDY OF BANANA 


Agriculture sector is an important part of the economy of the island, 
since according to the french ministry of agriculture [4], it employs 
12% of the active population. Bananas are one of the most cultivated 
crops on the island of Guadeloupe with about 7.00%. 

Banana is a crop sensitive to climatic variations, that directly 
affect the yields. Thus, in the second phase of our data analysis work, 
we sought to understand the consequences of the climatic variations 
highlighted on the plant. To better understand the impact of climate 
trends on banana, we studied the climate changes in relation to the 
life cycle of the plant. 

The development of banana is closely related to the climate 
conditions [11]. In our dataset, we only have data on rainfall and 
temperatures on the past 50 years. However, these two components 
are essential, since water and heat are favorable for growth. Thus 
some optimum climate conditions for the plant development are 
the following: 


e Growth phase: 

- Temperature: 24°C < temperature < 27°C 

- Rainfall: 1300mm < annually precipitation < 2600mm 
e Foliar phase: 

- Temperature: 26°C < temperature < 28°C 


In this last part of our study, we have analyzed the data in or- 
der to track the occurrence of these optimum conditions over the 
last decades. Figure 4 shows the evolution of the appearance of 
optimum conditions for banana development according to (a) tem- 
peratures for growing, (b) annual precipitation on the last five 
decades, (c) temperatures for foliar development. 

First of all, if we focus on optimum conditions of temperatures 
for growing phase (see Figure 4(a)), the results are striking. Indeed, 
it is rather unexpected to see that the conditions favorable to the 
development of the banana seem to shorten with decades. Thus if 
the results show that average temperatures increase over the years, 
this also results in periods of much shorter favorable temperature 
conditions to the banana development. 

With regard to optimal precipitation conditions, very interesting 
observations can also be made (see Figure 4(b)). Indeed, changes in 
precipitation seem to tend towards an exit of favorable period at a 
long term. 

If we focus on the optimal temperature conditions for the foliar 
development phase (see Figure 4(c)) we can observe that the trend 
is not the same. Indeed, the favorable conditions to the foliar de- 
velopment of the banana seem to lengthen with the decades. The 
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Figure 4: Occurrence of optimum conditions for banana de- 
velopment according to (a) temperatures for growing, (b) an- 
nual precipitation and (c) temperatures for foliar develope- 
ment 


results therefore show a spreading of the optimal period in the year 
for the foliar development of banana. 


6 CONCLUSION 


In this paper, we have used a data analysis approach to address 
climate change. Unlike approaches that center on climate modelling 
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for providing projections, here we have conducted a data analytics 
work to highlight evidences of climate alter extracted from field 
data. The work we have carried out is centred on the Guadeloupe 
French Island, located in the French West Indies, and tend to bring 
out climatic propensities occurred within the past 50 years and 
their impact on agriculture. 

Thus we first have gathered all rainfall and temperature data 
from various beacons on the island. Then, data were supplemented 
by several climate indicators, calculated from the data, which are 
expected to reflect significant changes in climate and impact plant 
development. 

In the second part of the work, we carried out data mining work 
to highlight the strong climate trends that have occurred over the 
past 50 years. As seasons are known for their important role in plant 
development, we then adopted a seasonal approach to highlight 
changes that occur over the two seasons of the Guadeloupe Island. 

Finally, in the last part, we focused on bananas, which is cropped 
widely cultivated on the island that is known to be sensitive to 
climatic variations. We have studied how observed climatic trends 
affect banana development, and more specifically, our approach 
has made it possible to observe how optimal conditions for banana 
growth and foliar development have evolved over the decades. 

The work we have conducted on this paper opens various inter- 
esting research tracks. First of all, in this paper we only concentrated 
on temperature and rainfall data. However, we can easily suppose 
that climate change may also be observed on other climate indi- 
cators. Thus, at short-term perspectives we plan to complete our 
dataset by adding other climate data to better characterise changes 
taking place on global climate and on seasons. 

Another interesting approach would be to complete the banana 
study after adding new climate indicators. Indeed, other climatic 
indicators involving humidity or sunshine are also known to have 
an impact on the development of banana trees, and plants in general. 

Finally, in a long-term perspective, extracted knowledge could 
help to adapt agricultural practices in order to have planned devel- 
opment cycles in phase with climate change. 
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ABSTRACT 


In Smart grid ecosystems, it is important to carefully choose the 
placement of the datasets across different kind of big data systems 
in order to achieve high performance of the workloads and con- 
formity with the business and data ecosystem. Our approach for 
datasets placement is based on metadata about datasets, workloads, 
and systems. This paper gives a general overview of the data place- 
ment module, proposes a high-level design and data model for our 
solution and presents the placement criteria. 
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1 INTRODUCTION 


In their efforts to transform power grids into smart grids, utilities 
rely on massive deployments of sensors distributed over all compo- 
nents: meters, data concentrators and directly on the power lines, 
and smart meters at the scale of a country or even a continent. 
In France Enedis, the major power distribution system operator 
(DSO) and one of the biggest in Europe, has an on-going plan to 
deploy more than 35 million smart meters by the year 2021. In 
this context, more and more data are collected providing fine grain 
insights about client consumption profiles and the behavior of the 
power grid. Data are critical towards more efficient management of 
energy resources and provides new levels of efficiency for business 
applications. These applications, such as predictive maintenance or 
demand forecasting, are mainly analytic-based. They are defined 
by data scientists that do not possess useful knowledge to master 
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the complexity of data applications in a Smart grid ecosystem. A 
Smart grid big data ecosystem, such as the one proposed by Enedis, 
relies on a data lake and several project spaces dedicated to user- 
specific needs. Users have different levels of skills. Users may be 
data specialists, IT specialists, data scientists, or non-IT profession- 
als. The data lake is based mainly on distributed file systems (such 
as Hadoop distributed file system HDFS ).Data lakes in modern 
architecture can also include NOSQL stores. In the case of a Smart 
grid, the lake stores data coming from: (i) smart meters in an Ad- 
vanced Metering Infrastructure; such data, used for counting, are: 
public smart meter measurements collected on the distribution grid, 
quantities of electricity consumed and produced, the consumption 
and production powers recorded at regular intervals; the maximum 
powers reached daily and other measures such as reactive energy 
or average voltage. (ii) sensors distributed over the grid; such data 
is used for real-time control and monitoring. (iii) outside the elec- 
tric grid: data from a forecasting center station, pricing catalogs, 
social media datasets referring to energy and utilities topics, special 
spatio-temporal data like geographical situations or weather. (iv) 
customers/electricity distributors: technical data such as the type of 
meter, the installed power, the existence of special devices for limit- 
ing disturbances, etc and classical customer data. As a consequence 
of the data collection from a wide variety of sources, datasets have 
different formats, structures, properties, and value distribution. Be- 
yond storage, one of the challenges of a Smart grids ecosystem 
is to be able to very easily and efficiently process and transform 
datasets in order to be a support for innovative data processing 
initiatives. This means that data should be extracted from the lake, 
loaded in data banks and transformed according to both technical 
and business requirements. To give an operational dimension to 
the lake, a data bank is associated with a data bank management 
system that allows its processing within a so-called project space. A 
project space includes reliable and efficient workloads that execute 
SQL-like queries / more basic operations on datasets of the data 
banks. Workloads may rely on a data bank management system like 
relational systems such as Teradata [1] , document stores such as 
MongoDB [2], wide-column stores such as Cassandra [3], key/value 
stores such as Riak [4], or graph systems such as Neo4J [5]. Work- 
loads can also rely on MapReduce based processing engines like 
Apache Hive or on MPP based processing engines like Apache 
Spark [6]. However, these data bank management systems do not 
guarantee in the same way the performance, latency, scalability, 
consistency, and availability. From the point of the data lake use, 
workloads of a project space need to access, join, and process data 
sets. Defining workloads needs extensive knowledge about data 
bank management systems. This can be an important issue given 
that not all users have such skills. As an example of Smart grid 
ecosystem users, we consider data scientists. Those users in their 
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daily work need to access and store datasets, experimentation re- 
sults... Multi-engine systems such as our ecosystem and its different 
spaces: the data lake, the data banks, and the data labs, suffer from 
the data placement problem [7, 8]: (i) where to store datasets (data 
migration)? ; where to execute the workload (query migration)? 
; (ii) what dataset to copy or to move? addressed in some related 
work as which views to materialize ; (iii) in the case the execution 
of a data pipeline composed of a set of queries, how to orchestrate 
the workload execution? what impact on the performance and the 
scalability? ; (vi) How to minimize the cost of moving a dataset and 
loading it to a different store? ; (v) How to fragment the data or the 
query to enhance performance? Our motivation to work on data 
placement is to address those issues. We aim to design a solution 
that optimizes the architecture of smart grids data ecosystems in or- 
der to minimize the cost of up-front and on-the-fly data transfer and 
loading between data systems which are generally time-consuming 
and redundant [9]. This paper presents our approach for effective 
placement in order to ensure better processing of datasets. A de- 
tailed state of the art and a demonstration of the originality of our 
data placement design was presented in a previous short paper [10]. 
Our approach is based on the properties of datasets, workloads 
and the target data systems which are modeled as metadata. We 
call our approach DWS acronym for Datasets, Workloads, and Sys- 
tems and it is driven mostly by use cases from Enedis considering 
their Big Data ecosystem called B4ALL. The remainder of the paper 
is organized as follows. Section 2 gives a global overview of our 
datasets placement approach. It presents the data model for the 
component of the placement module: the datasets, the workloads, 
and the systems. Section 3 gives an overview of the used metadata. 
Section 4 presents the data placement module general design and 
finally, Section 5 concludes the paper and proposes future research 
directions. 


2 USE CASE 


The use case of our study is the setup of adapted data infrastructure 
and optimize the current architecture for Smart grid data ecosystem. 
The datasets of our experiment are a subset of Enedis’s smart grid 
Table 1: Statistics about the Experiment’s Smart grid 
datasets 


Dataset Label Columns nbr Rowsnbr Data Size 
Conso- Aggregated 13 1048576 14M 
inf36 clients con- 

summation 
Conso- Aggregated 14 981101 0,9M 
sup36 industrial 

clients con- 

summation 
Family- Profiles 2 15 16 KB 
prof families 


data. The use case encloses aggregated data about the measure of 
the client’s electricity usage and the table above summarizes and 
shows statistics about it. This data is characterized by its sensitivity 
and the current legitimates restricts the research on this kind of data. 
For this purpose, we are working exclusively on open data that are 
artificially augmented; Indeed, the original size was multiplied by 6 
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in order to simulate Smart grids big data ecosystem. For the number 
of pages limit, this paper will present two example datasets and 
their associated workload. Those datasets are also characterized by 
their spatio-temporality, their complex data model and their high 
volume. The structure of our data is relational and it generally has 
a snowflake schema. It has a multidimensional data model and it is 
organized as aggregates produced by analytics applications. Hence, 
most of the attributes are structured or multivalued. Those datasets 


Profile 
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Figure 1: Entity relationship diagram for the experiment’s 
datasets. 


reproduce the electricity consumption in a 1/2 h step of the points 
of withdrawal of electricity connected to the Enedis network and 
that are less or equal to 36kVA . It gives energy volume withdrawn, 
the average load curves for customers with smart meters and the 
number of customers. Average load curve is the average of the 
volumes of electricity consumed over 1/2 h step given by sites 
equipped with communicating meters. Average curves are collected 
for two different categories of consumptions and then aggregated. 
Enedis also provides for the consumers a curve 's Representativity 
Index. This attribute is a ratio between the number of points on 
the Average Curve and the total number of points of the same 
customer category (same profile, same contracted power range and 
same industry) Conso-inf36 and Conso-sup36 data sets have similar 
schemes (cf. figure 1). We present in the following listing the most 
important columns of those datasets: 

e Horodate (H-date): a time serie that records the measure's col- 
lection time using a half an hour step. It has a DateTime format 
that represents a point in time defined precisely using the UTC 
standards. It follows the pattern: YYY-MM-ddTHH:mm:ssZ . 

e Profile (Profile-label): a categorical attribute that represents a 
standard profile in the sense ofthe Recoflux. It can have one ofthe 
following values :"PRO" for producers , "ENT3" for a particular 
type of enterprises; RES4" for residential customers... 

e Contracted power range (Plag-puis) : the electrical power sub- 
scribed by the user in his supply contract. It represents the maxi- 
mum possible amount of racking. In these datasets, the different 
powers are grouped into power ranges. This attribute has com- 
plex data type. 

e Number of withdraw points (Nb-point-s): The number of with- 
draw points corresponds to the number of sites with an active 
contract on the Enedis network. It is identical for every 1/2 hour 
of the same day. 


am _prot: Sting 


- Cd prof Profile 
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e Activity sector (Sect-act): The area of the activity. This attribute 
only appears in the business consumption data set. The data type 
of this attribute is complex. 

e Max day of the month (Jour-max): attribute that indicates 

whether the 1/2 hours considered is part of the day which reached 

the peak power consumption of the month in France. the type of 

this attribute is Boolean (0/1). 

Week of the month (Semaine-max): attribute that indicates 

whether the 1/2 hours considered part of the day which reached 

the peak power consumption of the week in France. the type of 

this attribute is Boolean (0/1). 

e Total energy withdrawn (Wh) (Tot-nrj-s): The total energy 
withdrawn corresponds to the volume of electricity consumed 
over 1/2 h given by all the sites of the profile and the power range 
considered. this attribute has numeric data type with double 
precision. 

e Average curve C1 (Wh) (Cour-moy-1): average load curve for 
of the group of measures whose ratio (Conso 8h-20h)/(Conso 
total) is the highest. In a like manner, Average curve C2 (Wh) 
(Cour-moy-2) represents the group with the lowest ratio (Conso 
8h-20h)/(Conso total) and Average curve C1 + C2 (Wh) (Cour- 
moy-1-2) is the average for all sites. 

e Curve's Representativity Index C1 (Indc-rep-cour-1): repre- 
sentativity index of curve 1 expressed in percentage. Similarly, 
two other columns are defined : Curve's Representativity In- 
dex C1 + C2 and Curve's Representativity Index C2 for the 
representativity index of the other types of curves. 

In smart grid data ecosystem, analytics applications transform the 

datasets using the sequel query language. Those applications have 

multiples joins, compute several aggregations and contain specific 

OLAP query operators.For our example, we choose a workflow that 

covers in the same time join operators, aggregations and temporal 

operators. The examples represented in this paper are extracted 
from an application that enriches clients’ consummation data with 

a normalized value of the electric power. 


3 OVERVIEW 


In this section, we begin with providing a high-level overview of 
the datasets placement in smart grid ecosystems. We then briefly 
present our approach that captures the intermediate representation 
of a query workload as a graph of operators and utilizes the data sets 
properties and the data processing systems descriptions to achieve 
a data placement that minimizes the cost of the workload execution. 
The placement process may take place into separate phases of 
a workload: just before the processing, on the input datasets, or 
after the processing, on the result and intermediate result datasets. 
Our data placement process takes as input a dataset S = <di,.., dn> 
and a workload W containing query operation <op1, ... opn> on S. 
Based on the properties of S and the data operations done on S, the 
placement module chooses the most adequate data system(s) for 
storing S and for processing S to guarantee efficient performance 
of W.A potential data placement solution can be: the storage and 
workload execution by a document data store like Mongodb or 
the storage in the distributed file system HDFS and the workload 
execution by a parallel processing engine like Spark. We seek to 
find the most effective solutions. In the following subsections, we 
detail the data model of the inputs (the dataset S and workload W). 
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3.1 Datasets model 


For the internal representation of the datasets in our placement 
module, we adopt the ODMG [11] data model. This model proposes 
two constructors to organize the values into datasets: the collection 
constructor that organizes values into sets, lists, bags, arrays, and 
dictionaries, and the aggregation constructor that organizes the 
values into objects or aggregates.In this model, values vi are either 
literals or aggregates. Those values are also categorized into atomic, 
structured or collections. We represent null values and unstructured 
values (e.g binary large objects: blobs) respectively as the Null 
value and an aggregated value. To build a dataset, we first use the 
collection constructor. It groups related aggregates di in a collection 
S: 

S= <di,..,dn>. 

The aggregate constructor creates objects that are either struc- 
tured (e.g. a tuple of a relational dataset), semi structured (e.g. a 
document) or unstructured (e.g. a blob). In our model, we keep the 
same structure for the aggregates and we represent the absence of 
an attribute-value pair as a missing value. Hence, missing values 
and null values are two different representations in our model, we 
associate to each a different literal. The variety of the data manipu- 
lated in our ecosystem motivates us to consider this flexible model. 
With this representation, we can process datasets managed in data 
stores having a relational data model, key-value store, document 
store or a graph based store. For instance, for a relational table, we 
represent the relations with the collection constructor and the tu- 
ples with the aggregate constructor. In document stores, aggregates 
are represented as documents and organized in collections. We rep- 
resent attribute-graphs by two datasets: nodes datasets and edges 
dataset. We use two separate collection constructors for the nodes 
dataset and the edges dataset. Aggregate constructors are also used 
to represent nodes and edges. Finally, we represent datasets stored 
in key-value stores with the collection constructor and key-value 
pairs with an aggregate (a unary aggregate). 


3.2 Workload model 


A workload in big data ecosystem quantifies the processing per- 
formed on data systems. In our context, we consider query work- 
loads derived from SQL applications and scripts used in the con- 
sumption aggregation use case. According to the proposed model, 
query workloads are grouped as flows of operations (or activities) 
and they are executed sequentially in data pipelines. Thus, we rep- 
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Figure 2: Schema of a data pipeline in a smart grid open data 
application use case. 


resent a query workload as a graph G= <V,E>. V is the set of vertices 
representing a logical operator, a variable or a value and E is the 
set of edges representing dependencies between the vertices. In the 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 316 


IDEAS’19, June 10-12, 2019, Athens, Greece 


query workloads graph, we use the intermediate representation of 
the query workload for the data activity structure. This interme- 
diate representation can be an asymmetric syntax tree (AST), e.g. 
Mongodb access API workloads, or a directed acyclic graph (DAG). 
The choice of representing the workload using the intermediate 
representation of the queries facilitates the execution of the place- 
ment algorithm independently to the environment and the target 
query language. In our experiment and for simplification reasons, 
we use ASTs. Figure 2 presents an example of a compact view data 
pipeline of a smart grids data processing application. This pipeline 
is structured as a DAG of data activity. Smart grid data activities 
generally aggregate different datasets. They usually scale out using 
horizontal partitioning and apply many filters and transformations 
to obtain the desired output. As an example of a data activity, we 
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Figure 3: Example of a data activity in a smart grid data pro- 
cessing application. 


consider the following graph (figure 3). This activity is the first step 
of the pipeline. It enriches the perimeter dataset with the date of the 
calculation of the indicator and stores the result in an intermediate 
table that will be used as an input for the remaining of the pipeline. 


3.3 Systems model 


A data system in big data ecosystems is either a data storage system 
like data stores, databases or distributed file systems (DFS) or a 
processing engine, e.g MPP based processing engines. We model 
for systems 2 entities: (i) Abstraction of a data system that details 
the characteristics of those systems modeled from state of the art 
surveys and experimentation on those systems. And (ii) Systems 
descriptors that represent available data systems in the smart grid 
data ecosystem. We represent a system descriptor as a composite 
object. We associate to this object two other objects: A Storage 
Descriptor and a Processing Descriptor. The Storage Descriptor is 
an object that defines attributes such as the data model, partitioning 
model, distribution model ... Similarly, the Processing Descriptor has 
as attribute: supported query and access APIs, physical operators, 
data model... Descriptors are represented according to the ODMG 
model as classes; the relationships between those classes are either 
composition and association or inheritance. 


4 METADATA SPECIFICATION 


The metadata we use consider three abstract independent but com- 
plementary layers of the ecosystem referred to as Applications and 
Workloads, Data and Systems [10]. Each layer is described by its 
specific metadata schema. 
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Applications/workloads metadata characterizes query mod- 
els, query logical operators as well as query workloads and query 
statistics. Other application metadata considered are semantic an- 
notations and rules. 


(script de calcul du bilan de la consommation electrique,DataPipeline 
(identifier:1000, 

application kind of : sql, 

label: « calculation of aggregates for electric consumption in smart grids », 
workloads type: interactive, 

Description: « script teradata for calculation of aggregates for electric 
consumption in smart grids », 

application category : analytics, 

execution time : 2019-01-25, 

administration metadata : ( 

keywords : [smart grid, aggregated clients consumption, aggregated 
enterprise consumption], 

author : John Smith))) 


Figure 4: Examples of workloads metadata for characteriz- 
ing a Smart grid data pipeline. 


The datasets metadata (cf. datasets modeling) describes for 
instance (but not limited to): dataset size, values distribution, avail- 


ability, location, schemas, administration metadataáÁe 


( SmartGridSchemaSample, Schema (( 
_id : Objectld("5afeecef46855208667 15e73"), 
schema label : "schema ds conso inf", 
description : "new description ", 
collection space : "smart grid open data", 
attributes definition : ( 
" Courbe Moyenne n°1 (Wh) ": double , 
"Profil": string , 
" Semaine max du mois (0/1): boolean , 
" Total énergie soutirée (Wh) ": double), 
constrains : ( 
"Courbe Moyenne n°1 (Wh)" : "nullable", 
" Profil" : "not null", 
['Horodate","Profil'" Plage de puissance 
souscrite” 1: 


specific_properties:( 
*unit for courbe moyenne1":"Wh" 


=) 
D 
Figure 5: Example of datasets structural metadata. 


We characterize datasets with descriptive statistics about the 
attributes and the records and with structural metadata by keep- 
ing track of their local and global schema. An example of a local 
schema for the dataset conso-inf36 is illustrated in figure 5. We also 
characterize for semi-structured datasets additional metadata like: 
the probability of having missing values for an attribute, nesting 
degree of structured values and structured values cardinality. 


(mongoDBAccessApiModule, QuerySystem( 
query, support, kind of : «natif», 
query categories : («CRUD»,«geospatial»), 
workload categories : {«interactiven,«batch»}, 
query kind. of: «Api», 
other. features: («Cursor-based Queries», «JOIN-style 


queries»,«Complex Data Types manipulation» «Restrict Query Result Set Size», «Key Match 
Options»,«View»,«udf1»,«Data Object Expiry») 


D 

(wiredTiger, StorageEngine( security support: « none » 
concurency, support: « MVCC » 
concurency, support: « Document » 
journal: « write-ahead log » 
compression.compression, kind of: « block » 


E 
Figure 6: Example of systems metadata about the storage en- 


gine of Mongodb and its query system. 
The systems metadata describe data stores, distributed file 


systems (DFS) and processing engines that take part of the big data 
ecosystem. The metadata schema of this level has the following 
main concepts: partitioning models, data models, query models and 
storage models for those systems.In our approach, we classified data 
systems properties. Then according to our classification / taxonomy, 
we specify additional properties as attributes (Fig 6). 


5 DATA PLACEMENT MODULE DESIGN 


Let us consider a dataset S as a set of tuples: <d1 
query Q composed of a set of several operations «op1,..,opn» (for 
instance: select/project/join ...). A data system DS is an appropriate 


,.,dn» anda 
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placement for S, if the layout of the dataset is supported by the 
storage engine and its processing engine supports the query Q and 
provides an efficient workload execution for Q. 
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Figure 7: Workflow of a data placement module. 


Our placement approach aims to address a set of issues described 
as follow: (i) big data store has APIs that are too specific to every 
data model and query workloads. Those systems no longer use the 
one size fits all solution and their APIs have partially overlapping 
querying capabilities [9]. Thereby the choice of the storage and the 
processing systems needs to take this aspect into consideration to 
ensure fitting the applications need. (ii) the result of the placement 
can be unacceptable in a productions environment. Allowing the 
user to freely choose the data system for his need can lead to place- 
ment errors. (iii) Sometimes, in the case of cross-referencing two or 
more datasets, it is more advisable to use native (system level) execu- 
tion engine for this type of workloads over application level hybrid 
processing. Indeed, existent parallel DBMS and multi-store engines 
load the datasets in memory and then execute the query application. 
This solution creates overhead in the query performance. We design 
our placement module to help the users to efficiently build they 
data applications. Datasets placement is identified based on systems 
characteristics and the functionalities offered by their APIs. In this 
objective, we propose to consider 3 decision criteria for the place- 
ment explained in details in our previous work [10]. We evaluate 
the feasibility of the placement by comparing the characteristics 
of the target systems. We check the conformity of the placement 
with the data and the business ecosystem using a set of rules. Fi- 
nally, we estimate the performance of query execution using cost 
models considering different datasets and query transformations. 
DWS cost model covers query execution , data transformation and 
data communication between stores. This solution helps us mini- 
mize the execution time and the data transfer time.The mechanism 
behind finding the optimal placement solution is inspired by query 
evaluation techniques in multi-store systems: As shown in figure 7, 
the placement process first generates a placement solution space by 
inferring on metadata. This step decomposes the workload accord- 
ing to the specification provided by the application and represented 
as a DAG (cf subsection 3.2). Then the algorithm decomposes the 
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queries of the workload into as set of sub-queries and solves the 
data placement problem for each sub-query combination. In the 
second step, DWS's placement algorithm identifies the systems that 
matches the feasibility of the placement as well as its compliance 
with business environment. Subsequently, this algorithm selects 
effective placement candidates based on the impact of the query 
workload and produces the placement schemas to be returned to 
the user. The final steps of the placement consist of generating the 
storage and execution configuration then executing the placement. 
In spite of the importance of the final step, we limit our contribu- 
tion to presenting the selected placement results to the user as a 
recommendation. 


6 CONCLUSION AND PERSPECTIVES 


In this paper, we presented our data placement approach that aims 
to assist the users in managing the complex smart grid data ecosys- 
tem and highlights the purpose and the importance of managing 
systems level metadata. The systems level metadata that we de- 
fined are characteristics of their design and internal architecture. 
Modeling run-time and configuration properties of those systems 
can be considered as perspective to this work. For future works on 
this project, we will include the implementation of our data place- 
ment approach and the experimental evaluation of the placement 
module. 
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ABSTRACT 


A fundamental problem in Social Network Analysis is how to move 
from single-layer to multi-layer, which provide a holistic view. 
User profiles resolution has received considerable attention since it 
allows to match users on different online social networks (OSNs). 
However, to the best of our knowledge, no study has focused on 
nesting operation for merging OSNs graphs. This work is a first 
step in the direction of defining the data model and the algorithm 
to perform approximate nesting of multiple OSNs graphs, based on 
user features. We provide initial experimental evidence based on 
synthetic data. 
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1 INTRODUCTION 


In Social Network Analysis an online social network (OSN) is a 
graph G = (N,E) where the nodes N represent users and the 
edges E represent social connections between them, like friend- 
ship, shared interests and working affiliation (Figure 1a). Tradition- 
ally, OSNs are studied as separate single-layer graphs. Recently, 
researchers have come to a holistic vision that includes more than 
one network at a time, that is multilayer social networks [8]. For- 
mally, a multi-layer network G = (N, E, L) consists of N nodes (i.e., 
the users), E edges (i.e., the social connections within and across 
layers) and L layers (i.e., the different OSNs), as shown in Figure 1b. 
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The multi-layer perspective allows studying a growing number 
of social structures and phenomena, exploiting the resultant net- 
work enriched by new edges across layers. In particular, the red 
edges in Figure 1b, namely the inter-layer edges, might be produced 
by a user profiles resolution algorithm identifying which users 
might be different personae impersonated by the same individual. 
Lately, the nested perspective also gained popularity within the 
Complex Network community [10] for the ability in representing 
the interaction between different subcomponents. Nested graphs 
provide a possible representation for user profiles resolution opera- 
tion on multi-layered OSNs. 


Problem Statement. The resolution of user profiles among dif- 
ferent OSNs allows moving from single-layer to multi-layer net- 
works, forming a resultant network (the bottom one in Figure 1b). 
For the use case in Figure 1b, this task aims at merging (e.g., match- 
ing) user profiles on different OSNs in a similar way to what hap- 
pens with a join operation between tables in a relational database. 
The resolution of user profiles is of the utmost importance since 
it allows to create the inter-layer edges across the layers. However, 
because user profiles that belong to the same user can have different 
user ids, email or nicknames, such resolution is very challenging [4]. 
Current literature provides different approaches for user profiles 
resolution using different types of information [17], such as basic 
user features (e.g., name, user id, mail address) [4, 14], user’s activi- 
ties log (e.g., texting, sharing, reacting) [1, 13], and SN’s topological 
information (e.g., friends and mutual friends) [15]. 


Proposed Approach. In this paper, we propose an approximate 
nesting approach for multiple social network graphs using the sen- 
sor pattern noise of the images captured and shared through the 


Figure 1: Single-layers vs multi-layer social networks: three 
different SNs as separated layers (a); the same three SNs 
forming the resultant network after the resolution of user 
profiles (b). 
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Figure 2: A zoom in of three clusters in the nested graph pro- 
viding the detailed description of the flattened graph in Fig- 
ure 1b. 


user's smartphone as user profiles resolution technique. In par- 
ticular, we exploit the clustering approach to solve user profiles 
resolution and then the graph nesting operator to approximately 
nesting multiple social network graphs into one single nested graph. 
For the first approach, we showed in [16] that the method success- 
fully clusters users based on the different cameras that the user 
exploited for sharing photos in different OSNs. Thus, we exploit our 
previous work to merge all together users on different OSNs while 
preserving their information content. For the second approach, we 
represent each single resulting merged node in Figure 1b as one 
single node containing all the users matched by the same inter-layer 
edge. Since the same user might use different cameras, a clustering 
algorithm might put him/her in different clusters, and therefore we 
might have distinct chains of inter-layer edges connecting the same 
user of the same layer. Therefore, the resulting user resolution is 
imprecise, and we need that the node nesting allows overlapping 
containments, that is the same user might appear in a different 
merged node. The aforementioned approximated nested graph data 
model also permits non-exclusive (i.e., overlapping) nestings as 
required by this scenario. We assume that the clustering algorithm 
flattens the multi-layers social networks as one resulting network, 
and all the resulting connected components are enriched by draw- 
ing an edge between each photo and the cluster to which such 
photo belongs. The zoomed-in output for the leftmost node clusters 
in Figure 1b is provided in Figure 2 as a nested graph: the nested 
nodes containing the nodes connected by the inter-layer edges are 
represented as orange squares, while the nested edges containing 
the follows edges among clusters are represented as bold edges 
between two squares. 


Contribution. Compared to our previous work [7], we need 
to generalise only the algorithm because the vertex and the edge 
grouping references are now separated by a distance greater than 
two (i.e., five). We also propose an alternative graph nesting query 
plan for this new scenario!. The output of this new query plan 
is yet another nested graph where all the nested nodes contain 
all the users whose photos belong to the same cluster, and each 
nested edge connecting two nested nodes contains all the friendship 
relationships among different layers relating the users to the source 
cluster to all the others in the target cluster. 


2 RELATED WORK 


Different approaches have been proposed for user profiles resolu- 
tion [17]. Works like [4] and [14] exploit information about users' 
identities, such as usernames, passwords, login information, to 


!FoSP source code is available at https://rebrand.ly/FHoSP. 
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match profiles across SNs. In [1] and [13], the authors collect log 
files within the user's device and study usage patterns to identify 
user activities on SN and compare common behaviours across SNs. 
A machine learning technique based on SN topological based fea- 
tures is proposed in [15]. On the flip side, smartphones have several 
built-in sensors that can be used for identification of devices [3]. 
In particular, the sensor pattern noise (SPN) is a reliable solution 
for fingerprinting smartphones due to the imperfections created 
during their manufacturing process [12] and helps to solve the user 
profiles resolution problem [16]. 

The nesting operator 7 [7] uses two subgraph clustering UDFs 
gy and gg, which are overlapping and partial, to summarize each 
subgraph into either a nested vertex or a nested edge of the final 
nested graph. Such operator generalizes the previously defined 
graph summarization and graph joins in literature. With respect to 
graph summarization, graph nesting is still more general than other 
recently proposed graph nesting operations such as [11], given 
that the presented operator is only able to list set of vertices in 
nested graphs and not entire subgraphs; in order to obtain that, 
this last approach is not able to nest entire subgraphs within the 
final edges. While previous graph summarisation operation mainly 
provides a partitioning of the graph, this operator allows fuzzy clus- 
tering with outliers. Graph 0-join [6] ingests two graph operands 
and returns a single graph, where it both fuses each 0-matched 
vertex pair from the two distinct components into one single vertex, 
and creates an edge between each fused vertex accordingly to a 
specific "edge semantic". If we represent both operands as distinct 
connected components of one single graph [6], and we use 0 to 
define gy and the edge semantics to define gr, the graph join op- 
erator is a specific case of the graph nesting operator where, on 
the other hand, the former loses the provenance information. Last, 
[7] showed that a two-hop might be expressed in other query lan- 
guages (both in relational, document-oriented and graph databases) 
as multiple group-by operations. Given that all such languages do 
not allow to perform multiple group-by operations simultaneously, 
their associated query plan shows to be inefficient. 


3 LOGICAL AND PHYSICAL DATA MODELS 


Given that multi-layer network might be represented as graph col- 
lections and given that both a graph database and graph collections 
might be represented as single distinct connected components of 
a single graph [6], we choose to represent multi-layer network as 
one nested graph. This assumption also helps us representing the 
photo's clustering information in no layer and as nodes shared 
among different possible social networks. 

The distinction between logical (nested graphs) and physical 
data model is required for distinguishing several roles that the data 
structures play. First, we represent nested graph operands after the 
loading and indexing phase as an extended adjacency list where 
each vertex v is associated to its id, its hash, its label-set and the 
attributes (i.e., properties) and their associated values. The graph is 
initially created in primary memory without the offset information 
(loading) and afterwards serialised into secondary memory using a 
specific vertex order detected in the previous phase (indexing). 

Second, the nested query result is only used by the user to read the 
outcome of the nesting process as in other query languages (such 
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follows 


hasPhoto * 


Figure 3: Graph schema of the flattened social networks 
(thick edges). The user will formulate both the vertex 
(dashed) and the edge (dash dotted) summarization patterns. 


as SPARQL and SQL) and does not have to produce “materialised 
views”. Therefore, the result of the graph query itself can postpone 
the creation of a complete “materialised view”. Such query result 
represents the adjacency list associated with the resulting nested 
graph, alongside with a nesting index. 


4 FHOSP ALGORITHM 


For representing Figure 1b as a nested graph, the user might use the 
clustering outcome to nest the networks: the user needs the nesting 
operator requiring two distinct graph patterns, one for nesting the 
graphs inside the nested vertices (vertex summarization), and the 
other for nesting the graphs inside the nested edges (edge summa- 
rization). These can be also derived from the graph schema (Figure 
3); those patterns are expressed by the following information need: 
“After nesting each USER into each photo cluster ID for each posted 
PHOTO, establish an edge between two photo cluster ID if and only 
if there are two USERs which are followers and which have photos 
belonging to those clusters, and nest the original following informa- 
tion within this nested edge". Instead of traversing all the patterns 
and then joining them together, we might visit first the PHOTOs 
to associate each user to all its potential cluster IDs, representing 
the final nested vertices. Then, we will visit all the follow edges to 
generate several nested edges connecting the cluster ID. 


4.1 Loading and Indexing 


Given that each user is identified by the set of the associated photo 
cluster descriptors, and given that we want to return a nested graph 
where (i) each nested vertex contains all the users having photos 
associated to the same cluster id and (ii) each nested edge between 
cluster c; and c; contains all the following relationships associating 
users from cluster c; to the ones in cluster cj, we want to visit the 
graph such that first we recognize all the photos and the users 
belonging to the cluster c;, and last visiting all the photos and users 
belonging to the cluster c;. This visit strategy allows minimising 
the graph visits required to generate the edge c; — cj and all of its 
nestings. This problem reduces to find all the possible dependencies 
within one single multi-layered flattened operand and to visit the 
graph's vertices in increasing order of mutual dependencies. 

In order to validate such assumption, we define three distinct 
ordering strategies influencing the graph visiting order: 1, 7; and 
T. The first two serialise the operand not taking into account the 
order of the mutual dependencies, while the last one considers the 
previous assumption. z is a loading and indexing strategy that orders 
the vertices by their ids and serialises the operand's adjacency list 
accordingly. Given that in our merged layer dataset the vertices' 
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id is randomly assigned for each flattened multi-layered graph 
g, the visit of g using such order does not necessarily guarantee 
an optimal ordering. This strategy provides the baseline for the 
following loading and indexing strategy meeting the requirements 
stated in the previous paragraph. The hashing strategy 7i is the 
same adopted in [7]: we serialise the vertices ordered by label's 
information and, in particular, we serialise first the cluster ids, 
then the photos and, last, the users. No specific order is preferred 
among all the nodes having the same label information. Finally, the 
topological strategy T requires a topological ordering of the operand 
itself. Nevertheless, such ordering requires such operand to be a 
Direct Acvcric GRAPH (DAG) while OSN's follow relationships in 
g cannot generally guarantee such condition. Therefore, we need to 
detect a feedback arc set [19] to break those cycles in the operand g, 
provide the topological ordering for each resulting DAG, and then 
serialise the two DAGs as one single graph. Given that finding the 
minimal feedback arc set is an NP-Complete problem, we use the 
polynomial heuristic defined in [18] to approximate our problem: 
we generate two DAGs g; and g» containing the edges (u, v) where 
u € v, and those where u > v, respectively. Last, the vertices are 
serialised using the topological order of gı and then g2. As a result, 
all the indexing costs are linear with the respect to the data size 
and, in particular, the topological loading and indexing requires an 
additional linear time to split the operands into two DAGs, thus 
resulting into the less efficient indexing and loading strategy. 


4.2 FHOoSP Nesting 


After loading and indexing each operand, we can now run the 
nesting algorithm. Please note that, due to the lack of space, we 
only describe the algorithm specific to the present paper's use case. 

We iterate the graph over each single node appearing in it (Line 
6): if the node u is a PHOTO (Line 8) we might extract all the users 
u associated to that photo and all the possible clusters c; in which 
u might fall into. As a result, c; will be one of the resulting nested 
graph's nested vertices (Line 14), which will contain u. We can now 
start to write the nesting index associating cluster c; to user u and 
save the same information in primary memory (Line 13). This node 
visit has the computational complexity of O(1) + Cp + |in(p)| - Cp for 
each photo p, where Cp is the number of the clusters associated to 
the photo p; indexing does not provide changes in the computational 
complexity. We denote such computational complexity as J. 

If the node u is a USER for which we have not visited all associ- 
ated PHOTOs for her/him (Line 20 and 28) or one of her/his followers, 
then we postpone the analysis once we'll have all the associated 
cluster information (Line 28 and 35); otherwise, we establish a new 
nested edge e between the two followers' cluster (Line 31) and nest 
the original follow's edge inside it (Line 32). We outline two com- 
pletely opposite scenarios, one a) providing the worst case scenario 
when traversing a non-sorted graph and the other b) traversing a 
topologically sorted graph. A) If we visit the peripheral users before 
the hub nodes in the community and, for each peripheral user, we 
visit first all the follow edges and then the hasPhoto ones. In this 
case the computational complexity for each user u is: 


Xo coutzorions(u)|OUthasPhoto(Y)| + |outnasPhoto(u)| + lout o11ows(4)| 


and the keys of usersToVisit are the size of the non-peripheral 
nodes H. B) If all the photos are visited before the users, then the 
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Algorithm 1 Five HOp Separated Patterns Algorithm (FHoSP) 


1: procedure 75(g) 

2:  visited:- 0 

3:  usersToVisit := [] 

4 — clUsersMap := [] 

5:  NestingIdx := open(File); NestedGraph := (V, €) 
6: 

ni 

8: 

9: 


» Bitmap 
> HashMap int +> unosorted set 
> HashMap int +> unosorted set 


for each vertex u € Vg do 
switch u.labels do 
case {PHOTO}: 


» via u.hash 


visited.add(u) » O(1) 
10: clusters:= (v | (u, v) € out(u) ^ v.labels={CLUSTERID} }  » C, := |out(u)| 
1 for each edge (v, u) € in(u) s.t. v.labels={USER} do » lin(u)| 
12: for each c; eclusters do 
13: clUsersMap[c;].add(v); NestingIdx.write((c;, v)) 
14: V.add(c;) 
15: case {USER}: 
16: skip:=false; friends:= 0 
17: for each edge (u, v) € out(u) with id e’ do > lout(u)| 
18: switch e' labels do 
19: case {hasPhoto}: 
20: if v évisited then skip:=true > O(1) 
21: case {follows}: 
22: if not skip then 
23: noSkip:=true 
24: for each edge (v, w) € out(v) with id edo > |out(v)|, v € out(u) 
25: if e labels-(hasPhoto] then 
26: if w ¢visited then » O(1) 
27: noSkip:-false 
28: usersToVisit[v].add(w, e) » O(1) 
29: if noSkip then 
30: for (cj, cj) € clUsersMap[u, v] do^ > Cu: C, 
31: e := cj € cj; E.add(e = (ci, cj)) 
32: NestingIdx.write((e, e’)) 

friends. add(v, e^) 

33: if skip then 
34: for (v, e) € friends do > lout(u)| 
35: usersToVisit[v].add(u, e) » O(1) 
36: for each v ckey(usersToVisit) do 
37: for (u, e’) € usersToVisit[v] do > |infollows(v)| 
38: for (cj, cj) € clToUserMap[u, v] do EQ Go 
39: e := cj € cj; E.add(e = (ci, cj)) 
40: NestingIdx.write((e, e’)) 


return (NestedGraph,NestingIdx) 


^ clUsersMap[u,v] is just a shorthand for clUsersMap[u]xclUsersMap[v] 


computational complexity for each user u is: 


(Exrcoutrerron()|MthasPhoto(®)| + Cu: Co} + |outhasPhoto(u)| 


where C,, is the number of the clusters to which user u is associated 
via the photos and no element is inserted in the map usersToVisit. 
Let us denote bu (by) the user to follower (photos) branching factor 
and k by the average cluster size: a) approximates to U(byby + 
by + bp) and b) to U(byby + bf + byk?). 

The postponed creation of the remaining nested edges is pro- 
vided at the end of the vertex iteration, once that all the graph 
data information is collected (Line 39). In the worst case sce- 
nario, that is when the majority of the node creation is postponed, 
then the computational complexity is P, ckey(usersToVisit) Cu ` 
ue inferiows(v) Cu Using the shorthands introduced in the former 
paragraphs and considering that this scenario is triggered for a) and 
never for b), this reduces to k*b,,H. Last, the nested graph is seri- 
alised in secondary memory: this part is omitted in the pseudo-code. 
The computational complexity of this part is linear with respect to 
the size of the output nested graph O. 

We can finally ask ourselves when the topological sort appears 
to be the best solution for traversing the graph: if we ignore the cost 


of T +O +U(bubp + be) which is shared among the two scenarios, 
then the question reduces to ask when k?b,y M < Ubu + k?b,H. 
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Table 1: Providing the single operand sizes (left) and each 
graph g used for the benchmark (right). 


Sampled | # Vertices # Edges Operands | # Vertices # Edges 
Layer1 37 46 Layer1 37 46 
Layer2 90 126 Layer1+4 60 76 
Layer3 88 130 Layer1+2+4 127 202 
Layer4 32 30 Layer1+2+3+4 | 199 332 


We observe that this happens when we are able to guarantee an 

almost perfect clustering where clusters contain in average at most 
u 

U-H 


5 EXPERIMENTAL RESULTS 


Our preliminary experiments for big data graphs (up to 100 million 
nodes) show that our approach outperforms the graph nesting im- 
plementation over PostgreSQL by at most one order of magnitude, 
which has already showed to be the best competitor in our previous 
graph nesting implementation [7] with which we share the same 
experiment assumptions; Virtuoso and Neo4J provided overall a 
worse performance than PostgreSQL. Even in this case, we con- 
sider the time to (i) serialize our data structure (Loading) and (ii) 
evaluate the query plan (Indexing and FHoSP). For our evaluations, 
we generate each layer by randomly sampling the Friendster social 
network graph [20] for 10 users using different seeds, and enriching 
that with the post (i.e., photo) and tag (i.e., cluster) distribution pro- 
vided by the LDBC Benchmark [9] and implemented in [2]. Given 
that this dataset were very small, topological distribution changes 
did not significantly affected the computation time of the algorithm, 
which was dominated by the vertices size. We refer to [7] for some 
nesting examples where the previous THoSP algorithm was per- 
formed on both real and synthetic data with different distributions. 
We kept the analysis to small social networks given that our pre- 
vious work on user profiling only focused on 10 different devices 
[16]. For the first and the fourth layer there are many clusters as 
users, while for the two remaining layers, we assumed that each 
user might use different devices, and therefore might be part of 
different photo clusters; the resulting graph layers? are described in 
Table 1 (Sampled). Similarly, we represent the multi-layered graph 
as one single edge table in PostgreSQL. Therefore, we provided a 
combination of four possible combination of the layered networks 
(Table 1, Operands), on top of which we run the FHoSP algorithm 
and the SQL queries. 

Table 2 provides a comparison between FHoSP over the three 
loading and indexing strategies and PostgreSQL: for our competitor, 
indexing happens during the query evaluation, and therefore we 
compare the sum of our indexing and nesting time to PostgreSQL’s 
query evaluation. Please note that the loading time in PostgreSQL 
corresponds to loading the edges’ tables for all the layers within 
one single table. We observe that, for small datasets, the loading 
time using the the topology ordering (r) is nearly comparable to 
the cost of loading the graph without ordering the graph in primary 
memory (i) or ordering the vertices by hash value (A). Nevertheless, 
FHoSP loading time is comparable to PostgreSQL’s. Finally, we 
might observe that a specific indexing strategy does not provide 


users. 


?The dataset is available at https://rebrand.ly/fhospdata. 
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Table 2: Separating Loading, Indexing and Nesting time (ms) for the nesting algorithm for the present use case. 


Loading | FHoSP (+:) FHoSP (+) FHoSP (+r) PostgreSQL Indexing +Nesting | FHoSP (+:) FHoSP (+) FHoSP (+r) PostgreSQL 
Layer1 0.228 0.386 0.231 3.133 Layeri 0.291 0.394 0.343 8.337 
Layer1+4 0.677 0.334 0.322 3.323 Layer1+4 0.415 0.440 0.485 8.557 
Layer1+2+4 0.686 0.810 0.705 3.833 Layer1+2+4 1.274 1.305 1.153 10.218 
Layer1+2+3+4 | 1.951 1.145 1.208 4.013 Layer1+2+3+4 1.944 1.736 1.695 20.125 


significant advantages for small datasets even though all the pro- 
posed solutions outperform PostgreSQL’s associated query plan by 
one order of magnitude. Further tests should be also carried out on 
bigger datasets in order to strongly substantiate this preliminary 
experimental evidence. 


6 DISCUSSION 


In this paper we did not discussed and evaluate the quality of the 
final nesting outcome: we already discussed the quality of the clus- 
tering approach in [16], while the present algorithm takes an output 
from the previous phase. Therefore, the present paper will only fo- 
cus on the computational complexity of the multi-network nesting. 
Our previous results in [16] suggested that the sensor pattern noise 
is a reliable characteristic to solve user profiles resolution problem. 
As a result, we present a paper providing a preliminary analy- 
sis comparing a nested graph ad-hoc implementation to current 
SQL query plans (i.e., PostgreSQL’s). Albeit the dataset of choice 
was small, the preliminary study conducted in the present paper 
suggests that the proposed approach is promising: while our seri- 
alization algorithm is comparable to PostgreSQL’s, our proposed 
algorithm (FHoSP) is always ten times faster than PostgreSQL’s 
query plan. To simplify the current problem, we assumed that all 
the OSNs had the same schema and that the relationships among 
users, photos, and cluster_id are always both labelled as the same 
and similarly represented across multiple different OSNs. On the 
other hand, different OSN may provide the same information us- 
ing a different representation, thus resulting in a different schema 
representation. In this case, we might use the Q operator [5] for 
nesting multiple graphs after aligning different OSN’s schemas as 
suggested in the current literature. 


7 CONCLUSIONS 


Social networks have always fascinated researchers who are inter- 
ested in various problems, like information dissemination, missing 
data and visualisation. Recently, there has been a growing interest 
in multi-layer networks, where the individual networks are mu- 
tually connected. In this paper, we present a preliminary study 
on approximate nesting of multiple OSN graphs based using the 
sensor pattern noise of the shared images. The clustering approach 
classifies the users according to the different smartphones used to 
capture and share the images, resulting in multiple inter-layer edges 
across layers. In our future work, we will also take into account 
multiple possible user descriptors to target user profiles resolution, 
and the possibility that the clustering phase is directly integrated 
into the graph nesting query plan. As a result, we would need to 
introduce approximate graph matching for covering multiple de- 
scriptions and generalise the currently provided nesting algorithm 
for any given graph nesting task expressible by Q. The results are 
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promising and suggest that further research should be conducted 
to evaluate our approach over bigger datasets over real data. 
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ABSTRACT 


We live in the age of big data and distributed computing. The cur- 
rent large scale computation frameworks are based on a scaling-out 
approach for distributing tasks over a cluster of commodity ma- 
chines. Apache Spark is one of these frameworks that has excelled 
in many computational tasks. Implementation of statistical learning 
algorithms over Spark is a challenging task. A bad implementation 
may lead to a significant decrease in performance and a waste of 
cluster time and money. Poor performance is mostly due to a lack 
of understanding of the data in hand and Spark’s underlying mech- 
anisms more than it is due to a deficit in the framework itself. In 
this paper, we consider the use case of y feature selection which is 
very popular in supervised learning pipelines. Our implementation 
follows the algorithm of the Scikit-learn Python machine learning 
library which is different than the algorithm used by the Spark 
machine learning library. The Spark ML library implementation 
of y? feature selection accepts only categorical features. Our al- 
ternative implementation is more suitable for numerical features. 
We experiment in particular with features of high sparsity such 
as n-gram counts. We study the best partitioning scheme of the 
data and the optimal number of partitions. Our experiments are 
run over the Databricks platform. 
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1 INTRODUCTION 


We live in the age of big data and distributed computing. The current 
large scale computation frameworks are based on a scaling-out ap- 
proach for distributing tasks over a cluster of commodity machines. 
These machines are relatively slow with cheap interconnects and 
abundant failures. Scaling-out has gained momentum face to the 
huge cost associated with scaling-up systems. Map-Reduce is the 
programming paradigm of scaling-out systems such as Apache 
Hadoop and Apache Spark. Spark has extended the Hadoop frame- 
work by improving performance and adding more utilities such as 
the ability to work with dataframes in a SOL-like manner. Feature 
selection is a very important building block in any machine learning 
pipeline. Its aim is to remove non-informative features and select 
the ones that are most useful for prediction. Feature selection helps 
reduce the training time, improve accuracy and avoid overfitting 
the trained model. Designing the best distributed implementation 
of feature selection is a challenging task. In this paper, we present 
an alternative implementation of y? feature selection over Apache 
Spark, based on the algorithm used in Scikit-learn, and evaluate 
it over the Databricks platform. y? feature selection is based on 
hypothesis testing which is a powerful tool in statistics. Hypothesis 
testing determines whether a result is statistically significant, or in 
other words, whether it occurred by chance or not. 

The Spark machine learning library implements y? feature se- 
lection only for categorical data based on building the standard 
x” contingency table for each feature. In contrast the scikit-learn 
Python library implementation of y? accepts numerical features 
such as term counts in document classification. Scikit-learn does 
not build the complete contingency table. Rather it uses a simplified 
X? formula to measure the dependence between a feature and the 
label (or class) as being two stochastic variables. Even though this 
may not exactly reflect the standard theoretical framework of the 
algorithm, it is widely considered as very useful in practise. Scikit- 
learn is however designed to work on a single machine. We propose 
an efficient and distributed implementation of the same algorithm 
over Apache Spark. We experiment with datasets of different sizes 
and sparsity. In particular we study the best partitioning scheme of 
the data and the optimal number of partitions. 

The remaining of this paper is organized as follows. In section 2 
we review the Spark framework and data types. y? feature selection 
is detailed in section 3. We discuss current implementations and 
present our alternatives in section 4. Experimentation and results 
are addressed in section 5. Finally section 6 concludes the paper 
and sheds light on future work. 
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2 BACKGROUND ON APACHE SPARK 


Apache Spark [15] is a unified engine that makes big data process- 
ing tractable through parallel computation, in-memory process- 
ing and on the fly optimization. It is a fault-tolerant and general 
purpose cluster computing system providing APIs in Java, Scala, 
Python and R. Spark runs workloads on large scale clusters 100 
times faster than its Hadoop predecessor. It was originally designed 
as a more general computing engine to specifically address three 
limitations in previous map-reduce frameworks: (1) capability to 
deal with general executing graphs and iterative algorithms that 
make many passes through the same data, (2) real-time stream- 
ing: compute tasks incrementally as new data arrives, and (3) in- 
teractive queries which makes it accessible through a notebook- 
like interface. Spark developers chose to keep a small core while 
contributing additional features through libraries such as MLIib, 
Spark SQL, Streaming and GraphX. The data sharing abstraction of 
Spark is called "Resilient Distributed Dataset", or RDD. RDD isa 
distributed collection of JVM objects that are strongly typed and 
support functional operators such as map, filter, reduceByKey, 
flatMap, etc. Pair RDD refers to an RDD of key/value pairs. RDD 
are however considered unstructured since the internal structure 
of the RDD objects is unknown. Dataset is a new interface that pro- 
vides the benefits of RDDs along with the benefits of Spark SOL's 
optimized execution engine. A Dataframe is a Dataset of rows orga- 
nized into named columns. A Dataframe is conceptually equivalent 
to a table in a relational database and supports operations such 
as show, select, agg, groupBy, join, etc. For more details 
about these three APIs (RDD, Dataset and Dataframe), we refer the 
reader to [5]. Spark can perform any parallel computation based on 
its map-reduce general paradigm. The map operation allows local 
computation tasks by transferring the code to the nodes hosting 
the data. The reduce operation allows all-to-all communication 
which can emulate any message exchange, even though sometimes 
inefficiently. Yet a series of smart optimizations made Spark's per- 
formance comparable to many specialized distributed computation 
systems. For instance, Spark achieves fault tolerance efficiently by 
using lineage graphs and avoid storing data that can be-recomputed. 
Lost partitions are re-computed from lineage graphs in case of a 
failure. 

Programming under Spark might be seen as less flexible than 
under other distributed computation frameworks such as message 
passing ones. Still, some optimization techniques are available. For 
instance the developer can customize the partitioning of the data 
among the nodes. A smart partitioning may bring substantial per- 
formance gains in face of shuffles. Shuffling is moving data from one 
node to another to be grouped with its key. Shuffling is required by 
some operations such as reduce or groupByKey. By default, Spark 
uses hash partitioning which attempts to spread the data evenly 
across partitions based on the key hash. Other options to control the 
partitioning of pair RDDs are to set the number of partitions, use a 
range partitioner or create a new customized partition scheme. 


3 x? FEATURE SELECTION 


The y? statistics [14] are commonly used to rank binary, discrete 
and nominal features. The score of a feature is a measure of how 
much the expected count E and the observed count O deviate from 
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each other. The expected sum assumes the independence of any of 
the categories of the feature f from the class label c: 


E(f = vi,c) = f = vi) = S(f = v;)P(c) 


where: S(f = vj) is the number of occurrences of the category vj 
among all the data instances, count(c) is the count of data instances 
belonging to the class c, total count is the total number of instances 
in the dataset, and P(c) is the likelihood that a randomly drawn 
data instance would belong to c. 

The observed sum O(f = vi,c) is simply the count of vj oc- 
currences for all the data instance belonging to c in the dataset. 
Assuming that the dataset has n data instance and k different classes, 
the overall y? statistic for a feature f with m different categories is 
given by: 


count(c) 


total count 


ys Y 3: (QU = eo) f oo 
E(f = vi,c) 
The same formula is used by the Spark ML implementation [4]. 

Xf) is used to test the independence of f and the class label 
c. The confidence of rejecting the independence hypothesis gets 
higher for higher values of y?(f), indicating that the feature f 
is very likely to be correlated with the class label c. Therefore 
a straightforward selection procedure is to sort the features in 
descending order of their y? statistics and the top ones are selected. 

Note that the calculation of the y? test involves only arithmetic 
operations such as addition and multiplication. For a dataset of N 
instances (or rows), k classes and n features of m categories each, 
the complexity of computing the y? statistics is O(n * N) time and 
O(n*m k) space. Using this formulation for numerical features will 
be very inefficient in terms of computation time since all different 
values of the feature are considered different categories (n would 
be large). Sometimes binning is used to decrease the number of 
categories. For example, in case of n-grams, it is common to work 
with 4 * 106 features. If each feature has 100 categories (or bins) 
and we have 10 classes, the needed memory is proportional to 
4% 10°. In fact, the current Spark ML implementation splits features 
into groups of 1000 features and treats them sequentially, making 
needed space proportional to 4* 10° only in this scenario. Therefore 
the contingency tables can be stored on a single machine for each 
iteration. Still for N = 10? data points the computation time is 
proportional to 4 * 10°. Our experience is that this scenario would 
be exhausting for a small Spark cluster with the current Spark ML 
implementation. It takes long hours. 

The implementation of scikit-learn [3] does not build a complete 
contingency table per feature. The features are numerical and their 
values represent the number of occurrences of the feature in each 
data instance, for example the number of appearances of a term in 
a document. The following formula is used: 


c=1 i=1 


k 
2m — NV OUO - E, 0)? 
xq 2; ERO 


where O( f, c) is the observed sum of numerical value of f across 
all data instances. E(f, c) is the expected sum assuming indepen- 
dence in between the feature and the class label. For a dataset of 
N instances (or rows), k classes and n features, the complexity of 
computing the y? statistics is O(n * N) time and O(n * k) space. In 
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case of the aforementioned n-grams example, the computation time 
is proportional to 4 * 10°. However, exploiting sparsity in the data 
makes it much lesser. Our Spark implementation is able to perform 
it in a few minutes as will be shown in the experiments section. 

To illustrate the difference between the two approaches, let’s 
consider the case of a binary feature with the following contingency 
matrix: 


class 


Co Cy 
f |A=3| B=5 
af | C=2 | D=10 


category 


A, B, C and D are the observed count of respectively (f, co), 
(f. e1). (^f, co) and (=f, c1). The size of the dataset is N = A+ 
B +C + D. We consider the numerical example where A = 3, B = 
5, C= 2, D = 10, N = 20. The two libraries will compute the xv 
scores as shown in the following table: 


Spark ML[4] Scikit-Learn[3] 
f 1 n(AD-BC)? (AD-BC) 
ormulà | C31 XAC-BXC*DY(B*D) | (A+B)A+C\(B+D) 
Score 10/9 2/3 


The Python API code to generate the scikit-learn score is shown 
next: 


from sklearn.feature selection import chi2 
import numpy as np 

X = np.array([1]*8 + [0]*12) 

X = X.reshape(-1, 1) 

Y = [0]*3 + [1]*5 + [0]«2 + [1]*10 

chi2 score, p_value = chi2(X,Y) 


The Scala API code to generate the Spark ML score is shown next: 


import org.apache.spark.ml.linalg.Vectors 
import org.apache.spark.ml.stat.ChiSquareTest 
val labels = Seq(0, 0, 0, 1, 1, 1, 1, 1, 

0, 0, 1,-1, Ty Wy. 14, 1, 14. P, 1, T) 
val feature - 
(for (i «- 0 until 8 ) yield Vectors.dense(1)) 
++ 
(for (i <- 0 until 12 ) yield Vectors.dense(0)) 
val data - labels zip feature 
val df = data.toDF("label", "features") 
ChiSquareTest.test(df, "features", "label") 

. show 


4 IMPLEMENTATION 


In this section we explore, in a more technical way, the different 
implementations of the computation of y? statistics and we propose 
our own alternative. 
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4.1 Scikit-learn: Implementation for a single 
machine 


Scikit-learn is a machine learning Python library. The scitkit-learn 
implementation starts by building the class-instance contingency 
matrix. An element ejj of this matrix is equal to 1 if instance j 
belongs to class i, 0 otherwise. It proceeds with the multiplication 
of two matrices: the class-instance contingency matrix of size k * N 
and the instance-feature contingency matrix N * n. The result is 
the class-feature contingency matrix of size k * n. The class-feature 
matrix contains the observed sums for all class-feature pairs. The 
matrix multiplication is very efficient in the case where the matrices 
fit in memory. Scikit's reliance on underlying numerical libraries, 
namely numpy and scipy, makes this matrix multiplication, and 
hence the whole procedure very fast. 


4.2 Spark ML/MLlib implementation: 
Categorical features 


Spark ML supports Pearson’s Chi-squared (x?) tests for indepen- 
dence. The API ChiSquareTest takes for input a dataframe of cat- 
egorical labels and categorical features. The ML implementation 
is just a wrapper that transforms the dataframe into an RDD of 
LabeledPoint and passes it to the old implementation (ChiSqTest) 
of the Mllib library. The ChiSqTest implementation is marked as ex- 
perimental and belongs to the org. apache. spark.mllib.stat.test 
package. The RDD is passed to the chiSquaredFeatures function. 
The function starts by counting the number of features and build- 
ing an array of type ChiSqTestResult to store the results for each 
feature (referred to as col in the implementation). The maximum 
number of allowed categories per feature (and the maximum num- 
ber of allowed distinct labels) is fixed at 10,000. The function groups 
the features into batches of 1000 features each and processes them 
sequentially. The transformation mapPartitions is followed by 
the action countByValue in order to generate the (category, label) 
pairCounts. For each feature, the pairCounts are accumulated in 
a contingency matrix. The Breeze linear algebra library is locally 
used to compute the final y? results. 


4.3 Our implementation over Spark: Numerical 
features 


We propose a Scala implementation of y? feature selection for Spark. 
Our implementation is different than the Spark ML library in the 
following points: (1) It is based on the scikit-learn formulation for 
feature selection. This choice allows some optimizations which 
are not possible in the Spark ML implementation, namely taking 
sparsity of features in consideration. (2) It is mainly focused on 
numerical features. (3) It uses the capabilities of the dataframe API 
whenever it is possible. On the other side, the Spark ML imple- 
mentation is completely based on the RDD capabilities. (4) The 
implementation is completely distributed without recurring to the 
Breeze library. Breeze would require that the categories vectors and 
the contingency matrix are available locally and can fit within a 
single node. We do not make such assumptions. 

Our implementation has actually two different versions: the 
dense version and the sparse version. The dense version takes each 
feature vector as a dense vector and does not assume sparsity. By 
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Figure 1: Flow-chart of the sparse version in Scala/Spark 


Table 1: Summary of Datasets 


Dataset | Size Rows(N) Features (n) 
1 4.5 MB 32,561 288,196 
2 82.3 MB 748,401 1,163,024 
3 1.3 GB 4,203,876 11,557,504 
4 2.67 GB 8,407,752 20,216,830 
5 24.9GB 45,840,617 999,999 


sparsity we mean that the value of a feature could be 0 in a large 
number of rows (or samples). We measure it as the ratio of zero 
entries to non-zero entries in the instance-feature matrix. This is of- 
ten the case in term counts and n-gram features. The sparse version 
takes benefit of this property to speed up the computation. Note 
that sparsity can also be exploited in the case of categorical features. 
0 would be considered as a distinct category but the number of 
0’s occurrences would be computed as N minus the sum of the 
occurrences of all the other categories. 

The implementation flow chart is shown in Figure 1. It starts by 
loading the data in sparse format (LibSVM format), then it builds 
three dataframes (or tables): (1) the first dataframe has the total sum 
for each feature, (2) the second dataframe has the observed pair 
counts for each (label, feature) tuple, and (3) the third dataframe 
has the frequency of appearance of each label in the dataset. RDD 
capabilities are used to generate these dataframes. The last step is 
to build the contingency table out of the three dataframes using two 
joins and a user defined function chi2UDF. Dataframe capabilities 
are used at this stage. 


5 EXPERIMENTS 


We have experimented with datasets of different propoerties such 
as size, number of samples, number of features and sparsity [1]. 
The number of classes is k = 2 for all the datasets. The datasets 
properties are listed in Table 1. We store the datasets in an AWS S3 
bucket to be easily retrieved by the Spark Databricks cluster. All 
the datasets are in LibSVM sparse format. The LibSVM format is 
efficient for storing sparse datasets since only non-zero values are 
stored along with their indices. Each row represents a data instance 
(or sample) and has the following form: 
<label> <index1>:<value1> <index2>:<value2> ... 

For example The line '*1 2:1 4:2’ represents a data instance with class 
+1 and feature vector [0, 1, 0, 2]. 

We have used the databricks platform for our experiments. Databricks 
provides a tiny cluster for preliminary experiments known as the community 
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Table 2: Summary of Cluster Specifications 


Community Edition (CE) Paid Edition (PE) 
2-8 workers 
128.0-512.0 GB Memory 
0 workers 
16-64 Cores 
3.6-14.4 DBU 
1 Driver: 6.0 GB Memory 1 Driver: 64.0 GB Memory 
0.88 Cores 8 Cores 
1 DBU 1.8 DBU 
Free Cost = $0.40 per DBU 
Scala 2.11 
Spark 2.3.1 
Databricks Runtime 4.3 


2000 


1500 


1000 


Seconds 


500 


DV SV(CE) SV(PE) 


Figure 2: Runtime for dataset #3 with Dense Version DV, 
Sparse Version - Community Edition SV(CE), and Sparse Ver- 
sion — Paid Edition SV(PE) 
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Figure 3: Runtime for the 5 datasets over the PE cluster 


edition. The paid edition provides a pre-configured Spark cluster based on 
AWS machines. The specifications of the two options that we have used are 
shown in Table 2. 

We compare the performance of our two versions (DV and SV) over the 
two clusters (CE and PE) based on a common dataset. The results for dataset 
#3 are shown in Figure 2. It is clear that taking benefit of sparsity radically 
improves performance. The processing takes only a few minutes over the 
real cluster (PE). The runtime for the different datasets over the PE cluster 
is shown in Figure 3. 

We have also experimented with different partitioning schemes. For 
Dataset #5 we have set the number of partitions to 80, 160, 200, or the 
default settings. The default settings are defined by Spark and depends on 
the dataset size and the number of available cores [2]. We plot the runtime for 
different size percentages (20% to 100%) of the Dataset #5 (~25GB) in Figure 
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Figure 4: Effect of Partitioning - dataset #5 


4. Results show that increasing the number of partitions helps decreasing 
the overall runtime. The runtime increases softly with respect to the data 
size. This is a good indication that the distributed implementation over 
Spark is working well. We have also experimented with re-partitioning 
the data based on the class as a key. The re-partitioning has shown similar 
performance in the case of our datasets. We plan to investigate partitioning 
in more depth in future work. 


6 RELATED WORK 


There is an increasing interest in implementing distributed linear algebra 
and machine learning routines and data structures over Hadoop [13] and 
Apache Spark [7]. For nice lessons learned while implementing a basic 
machine learning algorithm we refer the reader to the talk in [8]. Another 
area of interest focuses on automatic building and optimization of complex 
Spark applications. In [12], a component-based framework for composing 
independently developed Spark applications is proposed. This framework is 
equipped with a transformation-based optimizer that takes a Spark program 
and generates a state-space of semantically equivalent programs by applying 
a set of rewrite rules. The best semantic-equivalent program is returned 
based on a set of pre-selected strategies. 

Selecting the maximum quality levels to execute given Spark applications 
with quality of service constraints is investigated in [10]. ASC [16] is an 
automatic checkpoint algorithm that optimizes the selection, frequency and 
timing of RDD persisting in a long lineage. A checkpoint cuts off the lineage 
and save the data which is required in the incoming computations. The 
solution is shown to have a small overhead with respect to the performance 
benefits it brings in case of failures. Spark SOL is based on a highly extensible 
optimizer so-called Catalyst [6]. It is built using features of the Scala pro- 
gramming language, that makes it easy to add composable rules, and control 
code generation. Catalyst is used to build a variety of features tailored for 
the complex needs of modern data analysis. VEGA [9] is an Apache Spark 
framework for optimizing a series of similar Spark programs. These pro- 
grams are likely originated from an exploratory data analysis session. Data 
scientists can leverage Vega to significantly reduce the amount of time when 
modifying and re-executing Spark programs over large datasets. HYLAS 
[11] is a tool for automatically optimising Spark queries embedded in source 
code via the application of semantics-preserving transformations. Hylas 
can identify certain computationally expensive operations and transform 
them to better alternatives, which leads to signification improvements in 
execution time. The contribution of this paper is different since it considers 
a very specific task and contrasts it to the standard library implementation 
in Spark ML. 


7 CONCLUSION AND FUTURE WORK 


In machine learning, feature engineering is very important to reduce the 
training time, improve accuracy, and avoid overfitting the training data. 
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While most practitioners run their machine learning algorithms on special- 
ized GPU farms rather that on cluster computing platforms such as Hadoop 
and Spark, cluster computing is nevertheless essential for data cleaning, 
pre-processing and feature selection. One of the popular statistical models 
to select the most relevant features is the Chi-Squared test. In this paper we 
compared the implementations of Chi-Squared feature selection in scikit- 
learn and Spark ML. We proposed a new implementation for numerical 
features over Apache Spark. We showed the performance of our approach 
using different real-world data sets. We also studied the effects of spar- 
sity and partitioning. The benchmarking is performed over the Databricks 
cloud computing platform. It is worth noting that our approach runs much 
faster than the current Spark ML one for three main reasons: (1) we use 
a formulation with lower algorithmic complexity, (2) we do not have any 
sort of sequential processing of features in chunks like in the Spark ML 
implementation, and (3) we use the optimized dataframe API and Spark SOL 
routines whenever possible. In future work, we aim to study and implement 
advanced linear algebra, feature selection and machine learning algorithms 
over Apache Spark. 
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ABSTRACT 


Encrypted data transmitted from secure wireless channels as part 
of Internet of Things (IoT) infrastructures are mainly stored in 
database tables using plaintext records. As IoT data content 
confidence in supportive A.I. actions radically increases, the 
selection and use of database encryption is essential in terms of 
data integrity safety as well as IoT operations or decisions validity 
and non repudiation. Taking into account the amount of stored 
IoT data, data filtering and processing overhead needed for aptly 
decision making, it is essential for a developer to seriously take 
into account encryption-decryption queries overheads for big IoT 
data processing tasks, as well as extra time delays on encryption- 
decryption operations, since these translate into idle or receiving 
states energy expenditures at the wireless IoT motes. 


KEYWORDS 


Database encryption, Big data, IoT, performance evaluation 
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1 Introduction 


IoT is an ongoing technological tendency. It is estimated that 
there will be more than 9 billion devices on the 2020, most of 
which will operate autonomously based on AI algorithms 
decisions fed by IoT data. Up to now the decision logic of IoT 
actuators and controllers is performed by the Application services 
or cloud services layers, but this is about to change. Since 
contemporary IloT devices turn away from the 8, 16bit 
microcontrollers to low power 32, 64bit microprocessor devices 
of big flash storage and high throughput transponders, only the 
data sets to be left to the service end while the algorithmically 
logic gradually migrates to the motes end. 


2 Performance evaluation 


The performance evaluation scenarios include three types of data 
encryption/decryption stress tests performed to a PostgreSQL 
database server. The database system is a Dual-core 3.2GHZ 
system with 4GB RAM and 500GB storage size, of similar 
characteristics to the Microsoft Azure A2 instance provided as a 
business solution for small to medium databases [3]. Since 
commonly used Database benchmarks [4, 5, 6] do not include 
encrypted columns database tests, authors created a set of Python 
scripts to test PostgreSQL AES-128bit encryption currently used 
by IoT wireless nodes data transmissions. Three different cases 
are examined into a table of encrypted AES columns following a 
per field relational schema and a table of an AES column that 
follows the JSONB schema less form and includes all IoT 
measurements using the JSON notation [1, 2] 


Table 1: Average insertion, selection and aggregation for 
relational and JSOB fields on encrypted and non-encrypted data 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 329 


IDEAS’ 19, June, 2019, Athens, Greece 


1M 
records 
Relational | JSONB Relational | JSONB 
plaintext plaintext encrypted | encrypted 
5M 
records 
4.12 9.5 9.69 
Average 
Insertion 
Time (ms) 
3.81 8.25 8.64 10.55 
3424.72 8818.40 9333.66 7095.92 
Average 
Selection 
Time (ms) 
3595.73 34090.18 59356.59 
245.18 646.19 10005.12 9722.44 
Average 
Aggregati 
on Time 
(ms) 
2652.28 937.47 49846.93 45896.58 


The comparison results between data inserted into relational and 
JSONB on encrypted and non-encrypted fields have shown that 
for 1M records, the relational processing time is the fastest one to 
4.12988 ms. The JSONB encrypted processing time 
underperformed by giving average processing time to 9.60483 ms. 
For 5M records, relational plaintext remains the fastest amongst 
the rest and specifically, the average execution time is to 3.81368 
ms. In contrast, the JSONB encrypted processing time gave the 
worst results to 10.55648 ms. The select experimental results have 
shown that for 1M records relational plaintext is the faster one to 
3424.7229 ms and the relational encrypted is the slowest one to 
9333.6668 ms. For 5M records, the relational processing time on 
plaintext gives the best performance to 3595.73602 ms and the 
relational encrypted the worst one to 59356.59885 ms. The 
aggregation experimental scenario results for 1M records have 
shown that the relational processing time on plaintext performed 
the best and to 245.18394 ms in contract with the relational 
execution time on encrypted data which increased to 10005.12385 
ms. For 5M records, results showed that in the case of aggregation 
the JSONB processing time on plaintext performed the best to 
937.47463 ms and the worst case amongst the four is the 
relational processing time on encrypted data which increased to 
49846.93503 ms. 


3 Conclusions 


Ch. Asiminidiset al. 


Comparisons between relational tables of plaintext IoT data and 
relational tables of AES-128 encrypted records have shown that 
the mean processing time for IoT data inserts increases 2-2.5 
times. For single column decrypt-select queries the mean 
processing time increases 296-9296 of the corresponding plaintext 
queries processing time for 1M and 5M records accordingly. The 
authors also pinpoint an additive factor of increase of the select 
queries processing time, proportional to the number of the 
decrypted fields per query (additive factor of 1.5x, where 
x-number of decrypted fields). A single column aggregation 
function performed on an encrypted table field cost 25-18.7 times 
more processing time than on its corresponding plaintext field, for 
1M and 5M records accordingly. Comparisons between JSONB 
tables of plaintext fields and JSONB tables of AES-128 encrypted 
fields have shown that the mean processing time for IoT data 
inserts remains the same. For single column decrypt-select queries 
the mean processing time increases 8% the corresponding 
plaintext queries processing time for 1M and 5M records 
accordingly. Finally, cross-comparisons between JSONB tables 
and relational tables of encrypted AES-128 IoT data have shown 
that for data inserts the mean processing time for JSONB tables is 
25% more than the relational tables for 1M and 5M records 
accordingly. For IoT data selects, JSONB processing time is 40- 
60% times more than the relational table select queries on a single 
field for 1M and 5M records accordingly. 
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ABSTRACT 


This paper proposes a machine-learning framework for supporting 
intelligent web phishing detection and analysis, and provides its 
experimental evaluation. In particular we make use of state-of-the- 
art decision tree algorithms for detecting whether a Web site is 
able to perform phishing activities. If this is the case, the Web site 
is classified as a Web-phishing site. Our experimental evaluation 
confirms the benefits of applying machine learning methods to the 
well-known web-phishing detection problem. 
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1 INTRODUCTION 


Web security is an emerging trend, especially in the novel big data 
context (e.g., [1, 6, 15]). Traditionally, Web security has been ad- 
dressed by exploiting several approaches, such as privacy-preserving 
methodologies (e.g., [1]), hidden Markov models (e.g., [21]), logic- 
based approaches (e.g., [9]), and so forth. 

This traditional challenge, which involves in both academic and 
industrial research issues, is now emerging again due to its tight 
relation with novel big data trends (e.g., [7, 10, 19]), which has 
originated some very interesting approaches, among which [8, 18] 
are noticeable ones. 

Among several problems, Web phishing (e.g., [5, 16, 17, 20]) is 
of relevant interest at now. Phishing is a method to imitating a 
official websites or genuine websites of any organization such as 
banks, institutes social networking websites, etc. Mainly phishing 
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is attempted to theft private credentials of users such as username, 
passwords, PIN number or any credit card details etc. Phishing 
is attempted by trained hackers or attackers. Phishing is mostly 
attempted by phishy e-mails. This kind of Phishy e-mails may con- 
tains phishy or duplicate link of websites which is generated by 
attacker. By clicking these kinds of links, it is redirected on ma- 
licious website and it is easily to theft your personal credentials. 
Phishing Detection is a technique to detecting a phishing activ- 
ity. There are various methods proposed by so many researchers. 
Among them Data Mining techniques are one of the most promising 
technique to detect phishing activity. Data mining is a new solution 
to detecting phishing issue. So data mining is a new research trend 
towards the detecting and preventing phishing website. 

Starting from these considerations and in order to overcome 
the performances obtained, according by current literature, in this 
paper we propose a machine learning based method able to identify 
whether a web page is able to perform phishing activities. 

Figure 1 shows the big picture of our framework. As shown in 
Figure 1, in our reference application scenario, several Web Users 
are interaction with Web Phishing Sites (still unknown, of course), 
and the goal of our framework is just to detect the Web phishing 
sites and notify the users on. To this end, the component Feature 
Extraction is in charge of extracting suitable features to drive the 
machine-learning-based detection phase. Features are extracted 
and an ad-hoc Built-In Dataset is populated this way. Finally, the 
Decision Tree Algorithms run over the latter dataset and the Web 
phishing event notification is finally reported to the Web Users. 

This paper extends the previous short paper [4], where we in- 
troduced the main ideas of the proposed framework. 


2 DECISION-TREE ALGORITHMS FOR WEB 
PHISHING DETECTION 


In this Section we describe the method we propose for web phishing 
attacks detection. 

Table 1 shows the features considered in the following study. 

In order to collect data, we consider the PhishTank dataset !: 
PhishTank is a free community site where anyone can submit, verify, 
track and share phishing data. This dataset is in the form of .csv 
file format. 

The evaluation consists of two different stages: (i) we provide 
hypotheses testing, to verify whether the features vector exhibit 
different distributions for attacks and normal messages populations; 
and (ii) decision-tree machine learning analysis in order to assess 


‘https://archive.ics.uci.edu/ml/datasets/Website+Phishing 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 331 


Intelligent Web-Phishing Detection and Analysis 


IDEAS'19, June 10-12, 2019, Athens, Greece 


Websi Trade with confidenc 
=. world's largest Bitcoi 


3% ORDER 


«2 DISCOUNT 


= 


Web Users 


ben 


^l Web Phishing Sites 


Built-In Dataset 


Decision Tree 
Algorithms 


Web Phishing 
Detection 


<Web Phishing Detection Event> 


Figure 1: The Proposed Machine-Learning-Based Web Phishing Detection Framework 


Variable Feature 
F1 URL Anchor 
F2 Request URL 
F3 Server Form Handler 
F4 URL Length 
F5 Having IP Address 
F6 Prefix/Suffix 
F7 IP 
F8 Sub Domain 
F9 Website Traffic 
F10 Domain Age 


Table 1: The Feature Set Involved in the Study 


if the eight features are able to discriminate between attacks and 
normal messages. 

Machine learning is a type of artificial intelligence able to pro- 
vide computers with the ability to learn without being explicitly 
programmed [14]. 

Machine learning tasks are typically classified into two cate- 
gories, depending on the nature of the learning available to a learn- 
ing system: 


e Supervised learning: the computer is presented with example 
inputs and their desired outputs, given by a "teacher", and the 
goal is to learn a general rule that maps inputs to outputs. It 
represents the classification: the process of building a model 
of classes from a set of records that contains class labels. 

e Unsupervised learning: no labels are given to the learning 
algorithm, leaving it on its own to find structure in its input. 
Unsupervised learning can be a goal in itself (discovering 
hidden patterns in data) or a means towards an end (feature 
learning). 


The algorithms considered are supervised decision tree-based 
ie. they use a decision tree as a predictive model which maps 
observations about an item (represented in the branches) to con- 
clusions about the target of the items value (represented in the 
leaves). These algorithms (i.e., 748, HoeffdingTree, RandomForest, 
RetTree, LMT and DecisionStump) are the most widespread to solve 
data mining problems [14] for instance, from malware detection 
[2, 3, 11, 13] to pathologies classification [12]. 

We consider in this work five different machine learning algo- 
rithms in order to enforce the conclusion validity. With regards to 
the hypotheses testing, the null hypothesis to be tested is: 

Hp : ‘phishing and legitimate web pages exhibit similar values 
of the considered features". 

The null hypothesis was tested with Wald-Wolfowitz (with the 
p-level fixed to 0.05), Mann-Whitney (with the p-level fixed to 
0.05) and with Kolmogorov-Smirnov Test (with the p-level fixed to 
0.05). We chose to run three different tests in order to enforce the 
conclusion validity. The purpose of these tests is to determine the 
level of significance, i.e., the risk (the probability) that erroneous 
conclusions be drawn: in our case, we set the significance level 
equal to .05, which means that we accept to make mistakes 5 times 
out of 100. The analysis goal is to verify if the considered features 
are able to correctly discriminate between phishing and normal 
web pages. These algorithms were applied to the full feature vector. 
The classification analysis is performed using the Weka? tool, a 
suite of machine learning software, employed in data mining for 
scientific research. 


3 EXPERIMENTAL ASSESSMENT AND 
ANALYSIS 


We used five metrics in order to evaluate the results of the classifi- 
cation: Precision, Recall, F-Measure, MCC and RocArea. The results 
that we obtained with this procedure are shown in table 2. 


?http://www.cs.waikato.ac.nz/ml/weka/ 
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Algorithm Precision Recall F-Measure MCC _ Roc Area Class 
j48 0,904 0,892 0,898 0,829 0,958 legitimate 
0,923 0,916 0,919 0,833 0,958 phishing 
Hoeffding Tree 0,840 0,892 0,865 0,770 0,948 legitimate 
0,882 0,916 0,899 0,786 0,953 phishing 
RandomForest 0,891 0,892 0,892 0,818 0,968 legitimate 
0,917 0,912 0,914 0,822 0,966 phishing 
RepTree 0,856 0,911 0,882 0,799 0,964 legitimate 
0,933 0,872 0,901 0,804 0,961 phishing 
LMT 0,876 0,892 0,884 0,804 0,970 legitimate 
0,922 0,905 0,913 0,821 0,972 phishing 
DecisionStump 0,794 0,849 0,820 0,692 0,835 legitimate 
0,836 0,913 0,873 0,726 0,845 phishing 


Table 2: Classification results. 


As shown in Table 1 the proposed method is able to obtain a 
precision equal to 0,923 and a recall equal to 0,916 in phishing attack 
detection using the J48 algorithm. The classification algorithms 
obtaining the best precision are J48 and RepTree, but considering 
also the recall metric, we highlight that the RepTree recall is lower if 
compared with the one obtained by the J48 classification algorithm: 
this is the reason why we confirm the J48 algorithm as the one 
obtaining the best performances in terms of precision and recall 
in order to detect web phishing attacks. As a matter of fact, the 
remaining algorithms (i.e., HoeffdingTree, RandomForest, LMT and 
DecisionStump) exhibit lower performances than J48 and RepTree 
in terms of precision and recall. 


4 CONCLUSIONS AND FUTURE WORK 


Recently, a more effective approach to fight phishing that relies on 
machine learning techniques has emerged. In this approach, models 
extracted by a ML technique are used to classify websites either 
as legitimate or phishy, based on certain features. In this paper we 
propose a method machine learning-based able to detect whether 
a web page exhibits phishing attacks. As future work, we plan to 
extend the proposed features in order to increase the detection 
accuracy, in addition we plan to apply formal methods with the 
aim to detect the code in which the malicious action happens. 
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ABSTRACT 


Online voting is a challenging socio-technical problem that is still 
an open research question. Current approaches are based on in- 
volved cryptographic solutions that hardly can be explained to the 
average voter. Still, attackers may be able to circumvent the system 
without breaking the cryptographic protocols. In this paper, we 
demonstrate a toy voting protocol based on additive homomorphic 
encryption and two non-colluding parties. The protocol design 
aims at preserving the essential security properties such as election 
integrity and voter anonymity while being much simpler to explain. 
We propose a RESTful implementation of a web front-end of the 
proposed system. We highlight the fact that cryptography is only 
one facet of security. By conducting several web attack scenarios to 
test the robustness of the web interface, we show that the system 
may still be compromised without breaking the cryptography. 
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1 INTRODUCTION 


Voting is a method for a group of people to collectively elect a per- 
son, take a decision or express an opinion. Online voting systems 
are gaining acceptance with the widespread use of secure web ser- 
vices and cloud computing such as electronic currency and online 
banking. However, many researchers agree that online voting is 
hard. It has challenging privacy, security and accountability issues. 
We propose a cryptographic solution that is simple to explain to the 
voters. Our protocol is based on partially homomorphic encryption 
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and two non-colluding parties. We provide a RESTful implemen- 
tation of the voting framework using micro and distributed web 
services. More importantly, we recognize that the correctness of 
the cryptography does not ensure security, and that security is a 
process rather than just a product. Therefore, we pen-test our web 
services for the most known web and remote vulnerabilities. 


2 RELATED WORK 


Many secure and verifiable voting schemes are proposed with solid 
cryptographic foundations. They ensure the verifiability of the 
results (also known as the integrity property) and in the same 
time deny any match between the votes and the corresponding 
voters (also known as the ballot secrecy property). State of the 
art protocols use mix nets and homomorphic encryption. Many 
online systems offer verifiable elections such as Helios (https:// 
vote.heliosvoting.org/). With the arising popularity of Blockchain 
applications, voting has also been formalized in terms of a smart 
contract [4]. However, online voting has many practical constraints. 
Helios 2.0 is shown to be vulnerable to a man-in-the-middle attack 
by installing a browser rootkit that detects the ballot web page 
and modifies votes [3]. Blockchain has its security weaknesses. 
Several attacks against smart contracts are surveyed in [1]. In [8] 
the experience of attacking the Washington, D.C. Internet voting 
system highlights its many weaknesses. Within 48 hours of the 
system going live, it was almost completely compromised. Many 
countries dropped electronic voting for absentee overseas voters 
over cybersecurity fears!. Researchers consider voting as hard and 
suggest physical redundancy such as tally papers to accompany 
any online system [2]. The reason is that online voting is not able 
to deal with compromises after they have occurred like in the case 
of online banking. In e-banking, transactions, statements, and logs 
allow customers to detect fraudulent transactions. The banking 
fraud is considered a marginal cost of doing business. Internet 
voting systems are not similar since they deny fine-grained logs 
that may compromise the identity of the voters. 


3 VOTING FRAMEWORK AND PROTOCOL 


The voting protocol is based on a previous work of a subset of 
the authors [6]. A front-end broker is introduced here as a trusted 
entity to manage the communication in between the voter and two 
back-end servers: the trustee server which distributes sealed voting 
envelopes, and the tally server which counts votes under encryp- 
tion. At the end of the election. the trustee verifies the integrity 
using a cryptographic identity as shown in Fig. 1. We use Paillier's 
cryptosystem [7] for its ability to add votes under encryption. This 
cryptosystem is assymetric (public/private keys), probabilistic and 
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Figure 1: Voting protocol communication diagram 


IND-CPA (Indistinguishable under chosen plain-text attack). A high 
performance implementation of this cryptosystem is proposed in 
[5]. The voting protocol allows binary votes (0 or 1). The voter solic- 
its two envelopes from the trustee server. The trustee decomposes 
each choice into two shares. One of the shares is common across 
the two choices: 


0 = shareo + sharecommon 


1 = share, + sharecommon 


The trustee encrypts the shares, gives the vote an id and gener- 
ates two envelops: 


eno = (EpK(0) || Epy (share) || id) 
eno; = (Epk(1) || Epy (share) || id) 


The voter chooses only one of the two envelops, under an oblivious 
transfer protocol, and casts the chosen envelop to the tally. The 
tally keeps two separate counters: one for the votes and one for the 
shares. The two counters are transferred to the trustee at the end 
of the election for verification. More details, possible malleability 
attacks and countermeasures are discussed in [6]. 


4 SECURITY ASSESSMENT 


It is clear from our implementation experience that it is not suffi- 
cient to have the cryptography correct in order to create a secure 
online voting system, not even close. The system can be vulnerable 
to many web threats such as: 


e SQL injection. Tests were accomplished using SQLMap (http: 
//sqlmap.org/) 

e Cross-site scripting. Tests were accomplished using Xsser 
(https://tools.kali.org/web-applications/xsser) 

e Brute force hash and password cracking attacks for weak 
admin passwords. 

e Flooding and Denial of Service, we simulated stress con- 
ditions using Artillery (https://artillery.io/), a modern load 
testing toolkit. 


The framework is further available for testing in form of a virtual 
machine that can be downloaded at https://drive.google.com/file/d/ 
1Et7-6Wt1dUFbOjnkAtM KP83qntU U25/view?usp-sharing. The 
VM has a video with instructions to configure and run the voting 
web service. 


5 CONCLUSION AND FUTURE WORK 


Web developers may mistakenly estimate that implementing an 
election application is straightforward. However, security is a sub- 
stantial issue that is critical to maintain fair and anonymous voting. 
Security in voting systems is challenging given the multitude of 
attack vectors and vulnerabilities. In this paper, we have proposed 
sElect, an implementation of a binary voting system using homo- 
morphic encryption. We stressed on the fact that getting the cryp- 
tography right is not enough, and made the framework publicly 
available for further "black hat" testing. In future work, we will look 
at blockchain-based voting frameworks and assess their security. 


REFERENCES 


1] N. Atzei, M. Bartoletti, and T. Cimoli. A survey of attacks on ethereum smart 
contracts (sok). In Principles of Security and Trust, pages 164-186. Springer, 2017. 
2] Bruce Schneier. Voting Security. IEEE Security & Privacy. https://www.schneier. 
com/essays/archives/2004/07/voting security.html, July/August 2004. [Online; 
accessed 2015-07-24]. 

3] S. Estehghari and Y. Desmedt. Exploiting the client vulnerabilities in internet 
e-voting systems: Hacking helios 2.0 as an example. EVI/WOTE, 10:1-9, 2010. 

4] P. McCorry, S. F. Shahandashti, and F. Hao. A smart contract for boardroom 
voting with maximum voter privacy. In International Conference on Financial 
Cryptography and Data Security, pages 357-375. Springer, 2017. 

5] M. Nassar, A. Erradi, and Q. M. Malluhi. Paillier's encryption: Implementation 
and cloud applications. In 2015 International Conference on Applied Research in 
Computer Science and Engineering (ICAR), pages 1-5. IEEE, 2015. 

6] M. Nassar, Q. Malluhi, and T. Khan. A scheme for three-way secure and verifiable 
e-voting. In 15th ACS/IEEE international conference on computer systems and 
applications (AICSSA'18), 2018. 

7] P. Paillier. Public-key cryptosystems based on composite degree residuosity 
classes. In International Conference on the Theory and Applications of Cryptographic 
Techniques, pages 223-238. Springer, 1999. 

8] S. Wolchok, E. Wustrow, D. Isabel, and J. A. Halderman. Attacking the washington, 
dc internet voting system. In Financial Cryptography and Data Security, pages 
114-128. Springer, 2012. 


IDEAS2019 - 23nd International Database Engineering & Applications Symposium 335 


Blockchain-based Micropayment Systems: Economic Impact 


Nida Khan Tabrez Ahmad Radu State 
University of Luxembourg ArcelorMittal Europe University of Luxembourg 
Luxembourg Luxembourg Luxembourg 
nida. khan@uni.lu tabrez.ahmad@arcelormittal.com radu.state@uni.lu 


ABSTRACT 


The inception of blockchain catapulted the development of inno- 
vative use cases utilizing the trustless, decentralized environment, 
empowered by cryptocurrencies. The envisaged benefits of the tech- 
nology includes the divisible nature of a cryptocurrency, that can 
facilitate payments in fractions of a cent, enabling micropayments 
through the blockchain. Micropayments are a critical tool to enable 
financial inclusion and to aid in global poverty alleviation. The pa- 
per conducts a study on the economic impact of blockchain-based 
micropayment systems, emphasizing their significance for socioeco- 
nomic benefit and financial inclusion. The paper also highlights the 
contribution of blockchain-based micropayments to the cybercrime 
economy, indicating the critical need of economic regulations to 
curtail the growing threat posed by the digital payment mechanism. 
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1 INTRODUCTION 


Micropayments come in the category of electronic payment sys- 
tems, which are financial transactions that take place through an 
electronic medium without using paper checks or cash. A micro- 
payment is a financial transaction involving an amount of money 
less than a dollar or even a fraction of a cent but a definite number 
has not been agreed upon as a standard beyond which payment val- 
ues fall into micropayments as seen in the nomenclature assigned 
in related literature [12, 14, 19]. High transaction fees is seen as 
a limiting factor for conducting micropayments by conventional 
payment solutions and practical implementations to bring about 


Permission to make digital or hard copies of part or all of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for third-party components of this work must be honored. 
For all other uses, contact the owner /author(s). 

IDEAS’19, June 10-12, 2019, Athens, Greece 

© 2019 Copyright held by the owner/author(s). 

ACM ISBN 978-1-4503-6249-8/19/06. 

https://doi.org/10.1145/3331076.3331096 


a seamless deployment of micropayments below a dollar, still re- 
mains an area of research. Information economy, dominated by the 
presence of digital goods like blog posts and digital services like 
online newspapers, has ushered a new era of payments, that involve 
very small amounts. Micropayments provide the tool to harness 
the economic benefits that can be reaped from such a market. 

Blockchain is an immutable ledger of transactions, recorded in 
a decentralized distributed database and facilitates transactions in 
a trustless environment bringing down the costs associated with 
intermediaries in financial transactions [16]. However, scalability 
and performance issues are a bottleneck to optimum large-scale uti- 
lization of the blockchain technology [7]. This paper is a pioneer in 
analyzing the economic impact of blockchain-based micropayment 
systems. The paper gives the relevant background and related work 
in section 2. Economic impact of blockchain-based micropayments 
is discussed in section 3, while the implications for the cybercrime 
economy are elaborated in section 4. Conclusion is given in section 
5. 


2 BACKGROUND AND RELATED WORK 


A micropayment provider can reduce the transaction fee to facili- 
tate payments of small amounts. Apple launched the iTunes store 
in which songs are sold for 99 cents and Google Play also enabled 
micropayments as low as 10 cents per song. Both technology gi- 
ants, Apple and Google, handle these micropayments by employing 
a probabilistic model for user behaviour to pick an optimal time 
to balance credit risk versus transaction fee by batching several 
consumer purchases into one. Cryptocurrencies are digital assets 
that are used for conducting payments in blockchain platforms and 
they can be used to develop micropayment systems that enable 
payments in fractions of a cent. The divisibility property of cryp- 
tocurrencies aids in conducting micropayments and a low cost/ 
nearly zero transaction fee model adopted by a blockchain plat- 
form can help to position it as a blockchain-based micropayment 
system. Lightning Network, Raiden and Stellar are few examples 
of blockchain-based micropayment systems. However, the use of 
these systems is presently limited on account of the technological 
issues and an absence of explicit regulations to govern the usage 
of cryptocurrencies [17], as in cases of loss of funds there is no 
redressal mechanism to compensate the users. 

Chohan elaborates on the monetary role cryptocurrencies can 
play in hyperinflation [3]. Nica et al. discuss the economic benefits 
and risks of cryptocurrencies [17]. Fry and Cheah use econophysics 
models to examine shocks and crashes in cryptocurrency markets 
[9]. Liand Wang discuss the technology and economic determinants 
of the Bitcoin exchange rate [13]. Budish focuses on the economic 
limits of Bitcoin, indicating skepticism and caution about large- 
scale uses of the technology [2]. The present work deals with a 
study of the economic impact of blockchain-based micropayment 
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systems, including the adverse contribution of such systems to 
cybercrimes. 


3 ECONOMIC IMPACT 


Blockchain-based micropayment systems employ the use of cryp- 
tocurrencies to facilitate payments and as such are susceptible 
to all risks and provide all benefits that come in the domain of 
cryptocurrencies. However, being facilitators of micropayments, 
some additional beneficial impacts are observed which can aid in 
the economic progress of the society. The usage of these systems 
in dark markets and cybercrime has been discussed in section 4. 
The blockchain platform being used also has an impact on the mi- 
cropayment system’s economic viability. Micropayment systems 
employing proof of work consensus incur high energy consumption 
costs and are economically limiting [2], making the platforms un- 
sustainable for long term usage. However, the advent of blockchain 
platforms using other consensus mechanisms have provided a rem- 
edy to this undesirably high utilization of electricity, paving the way 
for economically viable blockchain-based micropayment systems. 


3.1 Socioeconomic Benefit 


The world can be divided into four income groups [23]. There are 
approximately 1 billion people in Level 1 that live on less than $2 
a day, who do not have even the basic necessities of life. Level 2 
contains 3 billion people surviving on $2 to $8 a day. Level 3 has 2 
billion people, whose needs encompass around $8 to $32 per day. 
Level 4 has 1 billion people living on more than $32 a day [23]. 
When we think of payment systems, it is clear that the mainstream 
financial organizations and payment systems do not cater to peo- 
ple in Level 1 and Level 2, which together make more than half 
the present worlds’ population. The rare who do cater like PayPal, 
have a fee structure which makes it infeasible to send amounts 
in fractions of a cent [18]. The people in Level 1 and Level 2 can 
certainly benefit from donations less than a dollar. Thus, from a 
socioeconomic perspective and to aid in the achieving of a few 
Sustainable Development Goals of United Nations Development 
Program [20], blockchain-based micropayment systems are a break- 
through in being able to transfer fractions of a cent, in real time 
and at extremely low cost, if not free. 


3.2 Revenue Generation 


The feasibility of micropayments brought about by blockchain- 
based systems can aid in further development of the market of 
microproducts [25], where products cost less than a few dollars. 
The development of e-commerce, virtual goods, online games, digi- 
tal advertising, social networks and sale of digital information has 
revamped the need for low fee-based, instant payment transfers of 
small amounts. If no intermediaries are involved in the process, then 
this further helps to make this alternative market more independent. 
Stellar involves anchors but it's extremely low cost transactions 
and use of blockchain make it a much better technology for mi- 
cropayments than existing micropayment systems. Global mobile 
app revenue in 2016 was $88 billion and it is expected to grow to 
$189 billion by 2020 [8]. App revenues are mostly generated by ad- 
vertising and in-app purchases that involve micropayments. When 
Apple introduced the micropayment pricing model in 2009, then 
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by 2017 around 50% of mobile app revenue was generated through 
in-app purchases involving amounts of the order of $0.99, $1.99 
and $2.99. These new products and market can add to the revenue 
generation of an economy. 


3.3 Elimination of Foreign Exchange and 
Enhancement of Stability 


Cryptocurrencies like BTC, from Bitcoin blockchain platform, are 
seen as commodity money [5], which are deemed to be compar- 
atively more transparent indicating any tampering done with it. 
Commodity money strengthens the social obligations of the issuer 
with respect to the wider society dependent on that currency [11] . 
Further commodity money has long been associated with price sta- 
bility [6]. Consequently Forex traders have been observed to secure 
their funds in BTC during periods of volatility to hedge against the 
instability of fiat currency. Stellar's native cryptocurrency, XLM, 
is however inflationary [24] and is developed more to integrate 
the blockchain platform into the existing fiat system as opposed to 
Bitcoin and Ethereum, who seek to provide a more stable monetary 
system, where a commodity serves as the anchor to stabilize. In 
the absence of government regulations, cryptocurrencies have wit- 
nessed very high fluctuations being used mainly for speculation and 
investment. A sound regulatory environment and a well-designed 
cryptocurrency has the potential to be a global digital currency, 
eliminating the need for foreign exchange. Research is ongoing 
towards this end as seen in stablecoins [21]. 


3.4 Effect on Monetary Policy 


Bitcoin was recognized to be private money by the German gov- 
ernment [4]. In an economic system where private money issue is 
permitted, the nature of optimal monetary policy changes signifi- 
cantly [27]. Fiat money is inconvertible and cannot be redeemed 
whereas private money like Bitcoin can be redeemed in outside 
money. Even amidst frictions in the functioning of the private 
banking system, private money allows for the intermediation of 
investment whereas fiat currencies do not, making private money 
superior to fiat [26]. Besides private money has the property to be 
elastic and it's quantity can respond to shocks in a way that a stock 
of fiat currency cannot [27]. On account of the above reasons, it 
can be envisaged that with scalability and low transaction costs, 
private money like a cryptocurrency, has the possibility to be used 
in transactions involving goods, instead of fiat currency [25]. 


3.5 Financial Inclusion 


The world presently has around 1.7 billion financially excluded 
adults. According to statistics provided by FINCA International, 
76% of the poorest people, in 20 countries across Africa, Eurasia, 
Latin America, the Middle East and South Asia, are financially ex- 
cluded [10]. Blockchain-based micropayment systems can provide 
to people in the lower income group, Level 1 and Level 2, to have 
access to a means of payment for buying microproducts as well as 
to store, send and receive payments of small amounts in their com- 
munity. It can prove to be an alternative for banking services for 
the lower strata of society. An increasingly high number of people 
are relying on e-payments in the world. However, the value of a 
card payment, in nominal terms, has declined over the last decade 
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and a half, from over $60 to less than $40. The smallest average 
value of a card payment was around $8 in 2016 [1] but card-based 
payments have no well-defined standard to cater to micropayments, 
leaving an unexploited avenue. Blockchain-based micropayment 
systems can fill this gap while accelerating the process of financial 
inclusion. 


4 CONTRIBUTION TO THE CYBERCRIME 
ECONOMY 


The economic definition of money implies usage of a currency as 
a unit of account, a medium of exchange and a store of value [17]. 
However, cryptocurrencies have been majorly used as a store of 
value, serving as assets for speculative purposes and as investment 
instruments. Their usage as a medium of exchange has been ob- 
served majorly in dark markets and online vendors selling illegal 
goods [17]. Blockchain-based micropayment systems use cryptocur- 
rencies as a medium of exchange and pose the threat of being used 
as enumerated above with the additional risk of being used in micro- 
laundering. Cybercrime economy is generating a minimum revenue 
of $1.5 trillion per annum with illicit and illegal online markets con- 
tributing to approximately $860 billion annually to the revenue [15]. 
Digital payments like PayPal are being used to engage in micro- 
laundering techniques [22]. So far, cryptocurrencies account for 
only 4% of money laundered, which is equivalent to $80 billion per 
year [15], but they remain a potential medium for further growth of 
this economy. Blockchain-based micropayment systems like Light- 
ning Network and Raiden, which provide the facility of conducting 
multiple micropayments without the payment transactions being 
recorded on the main blockchain platform can only serve to further 
this economy in the future. Cybercriminals reinvest the money in 
illegal trafficking of drugs, terrorist activities and further cyber- 
crime. Economic regulations and adequate Anti-Money laundering/ 
Combating the Financing of Terrorism (AML/ CFT) measures need 
to be implemented to target this growing threat posed by all digital 
payments. 


5 CONCLUSION 


In this paper, we conducted a study on the economic impact of 
blockchain-based micropayment systems and highlighted the con- 
tribution of such systems to the cybercrime economy. An analysis 
of the economic impact indicates that the low cost micropayment 
model provided by blockchains can serve to reach the underbanked, 
expanding the cryptocurrency user base. These systems can aid in 
poverty alleviation by facilitating donations of a few dollars, with 
the blockchain ensuring that the funds reach the intended recip- 
ients. Blockchain-based micropayment systems can promote the 
development and increase the revenue stream, from the microprod- 
ucts market. The absence of regulations has made cryptocurrencies 
vulnerable for exploitation in illegitimate uses. A study of the con- 
tribution of such digital payment mechanisms to the cybercrime 
economy brings out the critical need for the formulation and imple- 
mentation of economic regulations and preventive laws, to intercept 
and disrupt cybercrime. 
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1 INTRODUCTION 


In this paper, we introduce a new algorithm EWALDIS 
(Evolution- and random WALk-based algorihm for 
DIScriminative patterns) for mining discriminative 
patterns on the local level of dynamic attributed 
multigraphs. It uses a random walk-based approach 
[1] and a genetic algorithm to mine inexact patterns 
from the perspective of attributes and also times- 
tamps. This also means that it does not require the 
discretization of the timestamps to be able to find 
some patterns. Moreover, by utilizing sampling tech- 
niques, the algorithm does not have to traverse the 
whole search space. EWALDIS is an improved version 
of WALDIS [7]. 
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2 EXPERIMENTS 


For experiments, we set the probability of edge non- 
selection to 0.05 and we performed 1000 random walks 
on each dataset. 'To assess EWALDIS, we trained and 
evaluated a classification model on a data created 
from discriminative patterns. We used k-NN classifier 
with k = 3 as our model. Given a train and a test 
set of events, both of size N, we first discovered 
patterns by the following procedure. We repeatedly 
selected n events from the train set at random and 
ran EWALDIS on these events to get several patterns. 
Then we assessed the existence of such patterns on 
both the training and the test set. 


The assessment proceeds as follows for a given pattern 


and a given event: 


(1) Select one graph from the pattern; 

(2) Perform several random walks (10 by default) 
without restarts on this pattern graph while 
walking simultaneously in the tested instance. 

(3) The score of this pattern graph is given by the 
sum of similarities of the simultaneously-walked 
edges divided by the number of all random-walk 
steps. 

(4) If the random walk cannot continue in the 
tested instance, it continues only in the pattern 
graph and keeps counting the walks without 
similarities ; 

(5) Pattern graph gets then the highest score across 
all random walks. Such a score is computed for 
each pattern graph and the average is returned 


as the final matching score. 


By using this procedure, we compute the matching 
score for both positive and negative events from both 
training and test set. Then we use these matching 
scores from training data as new features and learned 
k-NN model on these data. The model is then eval- 
uated on a test set created from matching scores 


obtained on the original test set. 
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We also created a simple baseline method for com- 
parison with EWALDIS. The same classification al- 
gorithm was used, but the dataset features were pre- 
pared differently. Specifically, we created features 
from the edges adjacent to events. Each feature de- 
notes an edge encoded by a pair (label, relative time- 
stamp). For each event, the features had value either 
0 or 1 depending on whether there was such an edge 


adjacent to the event. 


3 RESULTS 


We have evaluated the algorithm on real-world graph 
data like DBLP and Enron ' We show in Table 1 
that the method outperforms baseline algorithm for 
all data sets and that the increase of accuracy is 
quite high, between 2.596 for NIPS vs. KDD from 
DBLP dataset and 30% for Enron dataset. A C++ 
implementation and the datasets as well as the full 
version of this paper, are available at https://github. 


com/karelvaculik/ewaldis. 


4 RELATED WORK 


Methods for discriminative pattern mining generally 
assume two sets of instances: positive and negative 
and the goal is to find patterns with regard to a 
defined discriminative score. Existing work typically 
focuses on sets of graphs, i.e. one instance is repre- 
sented as a graph and the pattern is its subgraph. 
Patterns are then used mostly for classification of 
input graphs. The task is in some extent different 
from ours as our positive and negative sets consist of 
vertex or edge events, and we search for patterns in 
the neighborhoods of these events. 

Here we only bring a list of works [2, 5], MINDS 
[4], TGMiner [9], Waddling Random Walk algorithm 
[1], and a predictive pattern miner [3]. More extensive 


overview of related work can be found in the previous 


paper [7]. 
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Settings Results accuracy 

Baseline EWALDIS 
Data Eos: Neg. train | test | train | test 
DBLP ICML KDD 80.0 67.5 84.5 83.0 
DBLP KDD ICML 74.5 64.0 80.5 79.0 
DBLP NIPS KDD 80.0 79.0 86.0 81.5 
DBLP NIPS ICML 63.0 53.0 74.5 65.0 
ENR. Bankr. Bus. 87.5 57.5 95.0 90.0 


Table 1. Experiment results 


As a part of our research we also created a simpler 
algorithm WalDis [7] that is based on a simple greedy 
approach and does not use a genetic algorithm. Pat- 
terns found by this algorithm are simpler and may 
not capture complexities captured by patterns found 
by EWALDIS. 
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